Recent Releases of https://github.com/rapidsai/cudf
https://github.com/rapidsai/cudf - v25.08.00
π¨ Breaking Changes
- Allow
np.dtype('object')for cases that are valid (#19478) @galipremsagar - [FEA] Remove CUDA JIT-Compatibility Checks & CCCL WARs (#19470) @lamarrr
- Drop cuda 11 usages (#19386) @galipremsagar
- Deprecate cudf::round for float types (#19298) @davidwendt
- Support output_dtype in cudf::reduce for nunique aggregation (#19265) @davidwendt
- Change default cudf-polars executor to "streaming" (#19263) @TomAugspurger
- Fix Handling of Complex Types in AST (#19248) @lamarrr
- Enable chunked reading of PQ sources with
>2Brows (#19245) @mhaseeb123 - Refactor
grid_1dclass (#19211) @lamarrr - Return valid for all-nulls in reduce() with nunique include-nulls aggregation (#19196) @davidwendt
- Refactor JNI error handling (#19149) @ttnghia
- Remove CUDA 11 from dependencies.yaml (#19139) @KyleFromNVIDIA
- Quick fixes of
modernize-use-constraintsrule (#19105) @vuule - Filter Parquet row groups using row bounds (#19082) @mhaseeb123
- Temporarily revert "Refactor JNI error handling (#18983)" (#19076) @abellina
- Rename
parquet_chunked_writertochunked_parquet_writerfor consistency with the reader (#19047) @mhaseeb123 - Compile libcudf using C++20 Standard (#19045) @vuule
- Refactor JNI error handling (#18983) @ttnghia
- stop uploading packages to downloads.rapids.ai (#18973) @jameslamb
- Remove deprecated Series methods, isclose (#18947) @mroeschke
- Remove deprecated groupby.collect (#18946) @mroeschke
- Remove deprecated get_dummies(cats=, ...) (#18944) @mroeschke
- Add pylibcudf.Column.from_arrow factory method (#18937) @Matt711
- Add pylibcudf.Table.from_arrow factory method (#18936) @Matt711
- Remove deprecated APIs (#18933) @vuule
- Remove cudf.Scalar (#18927) @mroeschke
- Remove deprecated
cudf::io::host_buffer(#18881) @Matt711 - Null-handling for Transforms (#18845) @lamarrr
- Enable
skip_rowsin the chunked parquet reader. (#18130) @mhaseeb123
π Bug Fixes
- Increase alignment requirement for parquet bloom filter to 256 (#19595) @mhaseeb123
- Revert "Add primitive row dispatch support for semi/anti join and cudf::contains" (#19503) @PointKernel
- Allow
np.dtype('object')for cases that are valid (#19478) @galipremsagar - Add conda dependency on nvidia-ml-py. (#19454) @bdice
- Mark
cudf.pandasnotebook repr test as flaky (#19441) @Matt711 - Fix pytest to properly expose a bug (#19433) @galipremsagar
- Switch from
thrust::sorttocub::DeviceRadixSortin Parquet chunked reader (#19414) @ttnghia - Use numba-cuda>=0.15.2,<0.16 (#19413) @bdice
- Update String Transform Examples (#19407) @lamarrr
- [BUG] Make floor division and modulo by 0 match CPU polars (#19406) @Matt711
- Handle empty input in cudf::strings::extract APIs (#19398) @davidwendt
- Fix jitify error on exit from FILTER_TEST (#19395) @davidwendt
- Update cudf.pandas tests to silence deprecation warnings (#19377) @Matt711
- Replace sprintf with snprintf in libcudf parquet tests (#19371) @davidwendt
- Make DateOffset respect timezone (#19366) @Matt711
- Fix flaky tests in
cudf.pandas(#19345) @TomAugspurger - Update protocol choices for ucxx in PDSH benchmark (#19343) @TomAugspurger
- Remove passing pandas tests from xfail list (#19341) @Matt711
- Fix Union-Slice bug (#19336) @Matt711
- Fix bit shift overflow in segmentedoffsetbitmask_binop utility (#19329) @davidwendt
- Fix job filters for
pandas-tests(#19322) @galipremsagar - Fix compile warning in interop_stringview.cpp (#19320) @davidwendt
- Fix a use-after-free issue in TDigest aggregation code. (#19311) @nvdbaranec
- Always represent datetime aware data as UTC in strftime (#19304) @mroeschke
- Do not pass cupy objects objects to numba kernels directly (#19283) @brandon-b-miller
- Correct docstring for
DataFrame.applyto match code (#19262) @dagardner-nv - Cast
n_uniqueaggregation result to match polars (#19256) @Matt711 - Fix Handling of Complex Types in AST (#19248) @lamarrr
- Add missing include (#19239) @vyasr
- Raised
MixedTypeErrorsfor condition that lead to mixed types (#19232) @galipremsagar - Fix errors in the nvCOMP adapter (#19221) @vuule
- Remove nvToolsExt usage (#19209) @vyasr
- Fix a pair of bugs in getdecompressionscratch() size. (#19207) @nvdbaranec
- Allow
is_list_liketo return correct values by disabling it (#19188) @galipremsagar - Fix slicing after
JoinandGroupByin streaming cudf-polars (#19187) @rjzamora - Fix
binopstype preservation for some dtypes (#19183) @galipremsagar - Fix streaming
GroupByon non-trivial keys (#19181) @rjzamora - Fix bitmask in fromarrowhost for sliced stringview type (#19174) @davidwendt
- Fixed group_by mean with missing values and multiple partitions (#19165) @TomAugspurger
- Add fallback to
HStacklowering in cudf-polars (#19163) @rjzamora - Fix
Literalpartitioning in cudf-polars (#19160) @rjzamora - Fix
from_array_interfacefor empty arrays (#19144) @Matt711 - Adding GH_TOKEN pass-through to summarize job (#19143) @msarahan
- Fix hash collision in Union([MapFunction]) (#19124) @TomAugspurger
- Fix bug in
group_by().n_unique()in streaming cudf-polars (#19108) @rjzamora - Parse (non-MultiIndex) label-based keys to structured data (#19103) @mroeschke
- Fix cudf_polars spilling (#19101) @TomAugspurger
- Fix libcudf strings case logic to set null-row size to zero (#19095) @davidwendt
- Temporarily revert "Refactor JNI error handling (#18983)" (#19076) @abellina
- Temporary workaround for incorrect
SplitScanresults in cuDF-Polars (#19071) @rjzamora - Use default memory resource for JSONQUOTENORMALIZATION gtests (#19057) @davidwendt
- Added null-probability to polynomial benchmarks and fixed transform call-sites (#18972) @lamarrr
- Fix flaky custreamz test (#18961) @TomAugspurger
- Fix tdigest percentile correctness for low row-counts (#18952) @mythrocks
- Enable
skip_rowsin the chunked parquet reader. (#18130) @mhaseeb123
π Documentation
- Update conda environment file for CUDA 12.9 compatibility (#19376) @a-hirota
- Update recommended gcc version in contibuting guide (#19365) @davidwendt
- Autodoc DateOffset (#19297) @wence-
- Fix cudf::columndeviceview::element() doxygen (#19296) @davidwendt
- Document aggregations for cudf::reduce in doxygen (#19264) @davidwendt
- add docs on CI workflow inputs (#19234) @jameslamb
- Update README and CONTRIBUTING to reflect new CUDA requirements (#19138) @PointKernel
- Remove the extra index URL for CUDA 12 (#19128) @vyasr
- Improve WordPieceVocabulary.tokenize documentation (#19098) @davidwendt
- Add some basic streaming engine documentation (#19088) @wence-
- Update the contributing guide to include pylibcudf in the build command (#19011) @Matt711
- Fix pylibcudf docs for some strings APIs (#19004) @davidwendt
- Update cuDF Python library design with BaseIndex and pylibcudf updates (#18903) @mroeschke
π New Features
- Avoid using UVM on systems without a traditional memory resource (#19444) @Matt711
- Add parquet-sampling configuration options (#19423) @rjzamora
- Add new JSON reader interface accepting string column input to pylibcudf (#19400) @shrshi
- Add a parquet reader utility to update output null masks (#19370) @mhaseeb123
- Build and ship
shim.cufile as LTOIR (#19368) @brandon-b-miller - Add cudf::strings::find_instance API (#19326) @davidwendt
- Add single-file streaming
Sinksupport (#19317) @rjzamora - Support null_count expression (#19314) @Matt711
- Materialize tables in the experimental Parquet reader (#19308) @mhaseeb123
- Add new cudf::top_k API (#19303) @davidwendt
- Add cudf::strings::split_part API (#19289) @davidwendt
- Support output_dtype in cudf::reduce for nunique aggregation (#19265) @davidwendt
- Add
post_traversalAPI to cudf-polars (#19258) @rjzamora - Deprecate
DataFrame.apply_rows(#19218) @brandon-b-miller - Require
numba-cuda>=0.16.0(#19213) @brandon-b-miller - Add a mode to co-process decompression and compression on host and device (#19203) @vuule
- Return valid for all-nulls in reduce() with nunique include-nulls aggregation (#19196) @davidwendt
- Refactor JNI error handling (#19149) @ttnghia
- Add support for horizontal string concatenation
pl.concat_str(#19142) @Matt711 - Add PDS-DS Query 1 (#19131) @Matt711
- Support
cudf-polarsstr.reverse(#19117) @brandon-b-miller - Support
cudf-polarsstr.pad_endandstr.pad_start(#19116) @brandon-b-miller - Support
cudf-polarsstr.headandstr.tail(#19115) @brandon-b-miller - Support
cudf-polarsstr.to_titlecase(#19114) @brandon-b-miller - Add
cudf/io/codec.hppto expose compression/decompression APIs (#19113) @ttnghia - Support converting decimals to/from pylibcudf scalars (#19106) @Matt711
- Support resource-constrained sort-merge inner join operation through left table partitioning (#19102) @shrshi
- Filter Parquet row groups using row bounds (#19082) @mhaseeb123
- Implement UDF Filters (#19070) @lamarrr
- Move the remaining libcudf pieces to C++20 (#19065) @vuule
- Allow using a stream per thread at runtime (#19051) @vyasr
- Remove stacktrace retrieval code (#19048) @ttnghia
- Compile libcudf using C++20 Standard (#19045) @vuule
- String Transform Examples: Added Branching, Public API Versions, and Sampling (#19038) @lamarrr
- Refactor JNI error handling (#18983) @ttnghia
- Add basic
Sinksupport for streaming cudf-polars executor (#18963) @rjzamora - Fix debug-build Failure in JIT Tests (#18939) @lamarrr
- Add from_arrow factory methods for Scalar and DataType (#18938) @Matt711
- Add pylibcudf.Column.from_arrow factory method (#18937) @Matt711
- Add pylibcudf.Table.from_arrow factory method (#18936) @Matt711
- Update nvCOMP adapter (#18931) @vuule
- Create a pylibcudf Column from a iterable of python strings (#18916) @Matt711
- Add CLI argument to enable OOM protection in PDS-H (#18914) @pentschev
- Implement data page pruning using Parquet page index stats (#18873) @mhaseeb123
- Null-handling for Transforms (#18845) @lamarrr
- Implement row group pruning with dictionaries in experimental PQ reader (#18836) @mhaseeb123
- Add support for parquet scan + count operation (#18463) @Matt711
- Manage strings with NRT (#18453) @brandon-b-miller
π οΈ Improvements
- Disable codecov comments (#19472) @bdice
- [FEA] Remove CUDA JIT-Compatibility Checks & CCCL WARs (#19470) @lamarrr
- Use libnvcomp conda package (#19439) @bdice
- JNI Set RMMLOGLEVEL and RMMLOGACTIVE_LEVEL to allow setting log level at compile time (#19435) @abellina
- Use numba-cuda >=0.14.0,<0.15.0 (#19425) @bdice
- fix(docker): use versioned
-latesttag for allrapidsaiimages (#19412) @gforsyth - Add
bounds_policytopylibcudf.lists.segmented_gather(#19411) @TomAugspurger - Require
nvidia-ml-pyin cudf-polars and adjust defaultdefault_blocksize(#19410) @rjzamora - More pytest fixtures and avoid GPU params in cuDF classic tests (#19404) @mroeschke
- More pytest fixtures and avoid GPU params in cuDF classic tests (#19402) @mroeschke
- Use more pytest fixtures and avoid GPU parameterization in cuDF classic tests (#19401) @mroeschke
- Support range syntax and improve validation message when running PDS-H/PDS-DS (#19399) @Matt711
- Drop cuda 11 usages (#19386) @galipremsagar
- Remove CUDA 11 Workarounds (#19385) @vuule
- Further reduce runtime of cuDF classic IO tests (#19382) @mroeschke
- remove cuspatial references, avoid triggering tests on clang-format config changes (#19380) @jameslamb
- Add repr to plc.aggregation.Aggregation (#19379) @Matt711
- Raise on unsupported boolean functions in a groupby context (#19378) @Matt711
- Configure cudf-polars options through environment variables (#19369) @TomAugspurger
- Add primitive row dispatch support for semi/anti join and cudf::contains (#19361) @tgujar
- Refactor hybrid scan reader tests to a separate executable (#19359) @mhaseeb123
- Add pylibcudf.Column.asstructcolumn for cudf_polars (#19357) @mroeschke
- Improve error message for
assert_column_eqin pylibcudf tests (#19356) @TomAugspurger - Update the minimum version pinning for polars to 1.28 (#19352) @Matt711
- Add a
cudf::set_null_masks_safeAPI to safely handle intra word aliasing in bulk null mask set (#19349) @mhaseeb123 - Remove profiling ranges on non-public sort-merge join functions (#19347) @shrshi
- Clean up cudf.lib.stringsudf.pyx (#19335) @mroeschke
- Add support for
pandas-2.3.1(#19334) @galipremsagar - Allow comparison binop to datetime.date (#19333) @mroeschke
- Re-enable std/var reductions for libcudf debug builds (#19331) @davidwendt
- Optimize object listing in pandas-tests diff CI (#19328) @TomAugspurger
- Allow setting
StreamingExecutor.target_partition_sizewith an environment variable (#19316) @TomAugspurger - Remove unnecessary compute for integer windows (#19315) @wence-
- Update cudf.pandas test skips for pandas==2.3.1 (#19313) @TomAugspurger
- Support Expr.str.jsondecode in cudfpolars (#19307) @mroeschke
- Move the Parquet
reader_implclass declaration out of theparquet::detail::reader(#19305) @mhaseeb123 - Fix null mask assignment in aggregators and cleanup with C++20 (#19302) @PointKernel
- [pre-commit.ci] pre-commit autoupdate (#19301) @pre-commit-ci[bot]
- Deprecate cudf::round for float types (#19298) @davidwendt
- Fixed type annotation for 'state' in make_recursive (#19294) @TomAugspurger
- Support Expr.str.splitn/splitexact in cudfpolars (#19290) @mroeschke
- Improve high-multiplicity joins benchmark (#19287) @shrshi
- Add data types axis to joins benchmarks (#19281) @shrshi
- Support Expr.str.stripprefix/suffix in cudfpolars (#19278) @mroeschke
- Support Expr.str.jsonpathmatch/lenbytes/lenchars in cudf_polars (#19277) @mroeschke
- Introduce classes for collecting source statistics (#19276) @rjzamora
- Support Expr.str.find & Expr.str.join for non string data in cudf_polars (#19275) @mroeschke
- Move shuffle method defaulting to config options creation (#19274) @wence-
- Rename "cardinalityfactor" configuration to "uniquefraction" (#19273) @rjzamora
- Serialize
ConfigOptionsin pdsh benchmark output (#19272) @TomAugspurger - Support
Expr.str.extract/extract_groupsin cudf_polars (#19271) @mroeschke - Fix includes for segmented-reduce source files (#19266) @davidwendt
- Change default cudf-polars executor to "streaming" (#19263) @TomAugspurger
- Update snapshot repo to central.soantype.com (#19259) @pxLi
- Raise
NotImplementedErrorforLazyFrame.profilewith the streaming exeuctor (#19257) @TomAugspurger - Move ast expression function definitions to .cpp files (#19250) @davidwendt
- Enable chunked reading of PQ sources with
>2Brows (#19245) @mhaseeb123 - Support
str.count_matchesandstr.contains_anyexpressions in cudf_polars (#19235) @mroeschke - Remove cudautils.py (#19233) @mroeschke
- Use CUDA 12.9 in Conda, Devcontainers, Spark, GHA, etc. (#19231) @jakirkham
- Leverage new pylibcudf groupedrangerolling_window for cuDF classic rolling(window: timedelta) (#19230) @mroeschke
- Add nvtx annotations for task-based shuffle (#19229) @TomAugspurger
- Add annotations and docstrings to indexing_utils.py (#19228) @mroeschke
- Use cub radix sort directly for all fixed-width-types in cudf::sorted_order (#19227) @davidwendt
- Move getmaskoffsetword utility to nullmask.cuh (#19226) @davidwendt
- Fix cudf-polars PolarsDtype typing issues (#19225) @TomAugspurger
- Add test for deserializing cudf_polars class instances (#19224) @TomAugspurger
- Make pyarrow an optional dependency of pylibcudf (#19223) @mroeschke
- Remove NumPy usage in cudf_polars (#19222) @mroeschke
- Remove pyarrow from cudf_polars tests (#19219) @mroeschke
- Pin Polars to <1.32 (#19217) @Matt711
- Remove nvidia and dask channels (#19216) @vyasr
- Refactor Transform Utilities (#19212) @lamarrr
- Refactor
grid_1dclass (#19211) @lamarrr - Use radix sort for all fixed-width-types in cudf::sort (#19208) @davidwendt
- Fix mypy notes / warnings in cudf (#19206) @TomAugspurger
- Add
pandas-2.3.0support (#19202) @galipremsagar - Avoid
pylibcudf.interop.to_arrowinDataFrame.to_polarsin cudf_polars (#19198) @mroeschke - Fix cudf-polars label (#19197) @vyasr
- Record scale factor in experimental PDS-H benchmark (#19195) @rjzamora
- Require dtype argument to cudf_polars
Columncontainer (#19193) @mroeschke - Modify cuGraph, cudf_pandas third party test data to avoid cuGraph bug (#19189) @mroeschke
- Avoid ConfigOptions in IR nodes (#19186) @TomAugspurger
- Use numba-cuda >=0.14.0,<0.15.0 to get pynvjitlink by default. (#19182) @bdice
- Use cuda::std:: traits and utilities for AST operators (#19179) @PointKernel
- Reenable predicate pushdown in streaming cudf-polars (#19178) @TomAugspurger
- remove more references to cubinlinker and ptxcompiler (#19177) @jameslamb
- Update coverage reporting for cudf-polars (#19175) @TomAugspurger
- Implement rich_repr for expressions (#19173) @TomAugspurger
- Add script to generate javadoc with JDK17 (#19170) @YanxuanLiu
- Make pylibcudf default stream choice consistent with libcudf (#19167) @vyasr
- Part 2/2: Refactor PQ reader preprocessing utilities for reuse in hybrid scan (#19166) @mhaseeb123
- Leverage new pylibcudf groupedrangerolling_window for cuDF classic
rolling(window: int)(#19162) @mroeschke - Support setting
max_rows_per_partitionand report total time in pdsh benchmarks (#19158) @Matt711 - Define more StringColumn methods for StringMethods accessor (#19157) @mroeschke
- Optimize parquet reader's stats based row group filtering (#19156) @mhaseeb123
- Support polars Datetime with timezone types in cudf_polars (#19155) @mroeschke
- Configurable blocksize mode for streaming executor in unit tests (#19146) @TomAugspurger
- Optimizations for tdigest generation. (#19140) @nvdbaranec
- Remove CUDA 11 from dependencies.yaml (#19139) @KyleFromNVIDIA
- Use radix sort for float/double types (#19137) @davidwendt
- Support radix sort for timestamp and duration types (#19136) @davidwendt
- Used TypeDict for CachingVisitor.state (#19135) @TomAugspurger
- Move Accessor implementation to their own directory (#19134) @mroeschke
- Add benchmarks for sorting float and timestamp (#19133) @davidwendt
- Enable using page mask in
decompress_page_datain Parquet reader (#19132) @mhaseeb123 - refactor(shellcheck): fix all shellcheck warnings/errors (#19129) @gforsyth
- Remove pytest pin (#19127) @vyasr
- Move pdsh utility functions/classes to a seperate module (#19126) @Matt711
- Use pylibcudf.Column.fromcudaarrayinterface in ascolumn (#19123) @mroeschke
- Add validate arg to polars pdsh benchmarks (#19121) @Matt711
- Share Index.values with base implementaiton (#19112) @mroeschke
- Use len instead of len(obj.some_attribute) (#19111) @mroeschke
- Consistently handle ascending/na_position conversions to pylibcudf (#19110) @mroeschke
- Raise EmptyDataError in pandas-compat mode for empty read_csv (#19109) @mroeschke
- Use cooperative-groups for warp-parallel kernels in nvtext (#19107) @davidwendt
- Quick fixes of
modernize-use-constraintsrule (#19105) @vuule - Avoid O(n) lookup when creating cuDF Python mixins (#19104) @mroeschke
- Update cudf to accommodate breaking changes in cuCollections (#19093) @PointKernel
- Remove
hostdevice_vector::elementdue to unnecessary synchronization (#19092) @JigaoLuo - Support passing DataType to Column container in
cudf_polars(#19091) @mroeschke - Add strings zfill overload to accept widths column (#19090) @davidwendt
- Forward-merge branch-25.06 to branch-25.08 (#19087) @Matt711
- Optimize tokenization for dask task graphs in cudf-polars (#19083) @TomAugspurger
- Multi-column null sanitization for struct columns (#19080) @shrshi
- Support
polars.Expr.value_countsincudf_polars(#19079) @mroeschke - Support
polars.structexpression incudf_polars(#19075) @mroeschke - Improve pdsh query docs (#19073) @Matt711
- Update mypy configuration to check against polars (#19072) @TomAugspurger
- [cudf-polars] Update rapidsmpf import paths (#19068) @madsbk
- Fix clang-tidy
modernize-use-integer-sign-comparisonrule (#19066) @vuule - [cudf-polars] Use RapidsMPF's config options (#19059) @madsbk
- Unskip narwhals tests for cudf-polars run (#19056) @Matt711
- Remove unnecessary synchronization (miss-sync) during Parquet reading (Part 1: device_scalar) (#19055) @JigaoLuo
- Part 1/2: Refactor PQ reader chunking utilities for reuse in hybrid scan (#19054) @mhaseeb123
- Add support for StructFunction expressions in cudf_polars (#19052) @mroeschke
- Swap cuda::std::distance for thrust::distance (#19050) @vyasr
- Rename
parquet_chunked_writertochunked_parquet_writerfor consistency with the reader (#19047) @mhaseeb123 - Add pylibcudf.Scalar.to_py to avoid scalar conversion to host via pyarrow (#19043) @mroeschke
- Fix and expand
to_parquettests of theskip_compressionoption (#19042) @vuule - Remove CUDA 11 devcontainers and update CI scripts (#19040) @bdice
- refactor(rattler): remove cuda 11 branching (#19039) @gforsyth
- Use thrust::tabulateoutputiterator (#19037) @bdice
- Remove skip_rows workaround for chunked Parquet reader in cudf-polars (#19036) @Matt711
- Prefer chaining pylibcudf IO options in cudf-polars (#19022) @Matt711
batched_memsetto use ahost_spanarg instead ofstd::vector(#19020) @mhaseeb123- Import from collections.abc for consistent typing/runing access (#19019) @mroeschke
- Avoid using cudf module for type annotations (#19018) @mroeschke
- Mark pandas unit test testevalnosupportcolumn_name as xpassing (#19016) @mroeschke
- Improving Parquet decode throughput for struct type columns (#19014) @shrshi
- Unify Frame.split and DataFrame.scatterbymap/partitionby_hash implementations (#19013) @mroeschke
- Move IndexedFrame.memory_usage docstrings to DataFrame/Series, make RangeIndex methods consistent with base class (#19010) @mroeschke
- Share DataFrame/Series.(de)seralize methods, implement to_dlpack directly on Frame (#19008) @mroeschke
- Pin narhwals to 1.41 (#19007) @Matt711
- Add year range check to cudf::strings::is_timestamp (#19006) @davidwendt
- Add cudf::strings::contains_multiple to pylibcudf (#19003) @davidwendt
- Avoid unnecessary partition step in streaming join (#19002) @rjzamora
- Part 2/n: Use cooperative groups in PQ decoders (#18978) @mhaseeb123
- Move libcudf copying benchmarks to nvbench (#18976) @davidwendt
- Add lag/lead/bitwise/row_number aggregations to pylibcudf (#18975) @mroeschke
- Switch to importing rather than cimporting datetime (#18974) @vyasr
- stop uploading packages to downloads.rapids.ai (#18973) @jameslamb
- Trace
IR.do_evaluatein cudf_polars (#18970) @TomAugspurger - xfail more pandas unit tests that fail with cudf.pandas before execution instead of xfailing after execution (#18965) @mroeschke
- Remove test checks that depend on the compression engine (#18960) @vuule
- Use cooperative-groups for warp-parallel kernels in strings functions (#18959) @davidwendt
- fetch code before running pull request labeler (#18958) @jameslamb
- Use cooperative groups in parquet decoder kernels (#18954) @mhaseeb123
- Add a DataType container in cudf_polars (#18953) @mroeschke
- add 'rapids-init-pip' to testcudfpolarspolarstests.sh (#18951) @jameslamb
- parameterized ucx / ucxx (#18949) @quasiben
- Rework cudf::sorted_order implementation for faster compile (#18948) @davidwendt
- Remove deprecated Series methods, isclose (#18947) @mroeschke
- Remove deprecated groupby.collect (#18946) @mroeschke
- Remove deprecated get_dummies(cats=, ...) (#18944) @mroeschke
- Add .python_typecode and .typestr attributes to DataType (#18941) @Matt711
- Remove deprecated APIs (#18933) @vuule
- Remove cudf.Scalar (#18927) @mroeschke
- Add #pragma once to prevent redundant includes and speed up compilation (#18925) @PointKernel
- Bump polars version to <1.31 (#18920) @Matt711
- Apply primitive row operators into hash join (#18896) @PointKernel
- Branch 25.08 merge branch 25.06 (#18895) @vyasr
- Remove deprecated
cudf::io::host_buffer(#18881) @Matt711 - Fix decompression scratch size in AUTO mode (#18878) @vuule
- Apply linter suggestions to cuIO code (#18876) @vuule
- xfail pandas unit tests that fail with cudf.pandas (#18872) @mroeschke
- Branch 25.08 merge branch 25.06 (#18855) @vyasr
- Add support for extended dtypes in
cudf.pandas(#18832) @galipremsagar - Auto merge fix for branch-25.08 (#18824) @davidwendt
- Forward-merge branch-25.06 to branch-25.08 (#18817) @Matt711
- Forward-merge branch-25.06 to branch-25.08 (#18756) @Matt711
- Fix auto merge conflict for branch-25.08 (#18733) @davidwendt
- Forward-merge branch-25.06 to branch-25.08 (#18698) @Matt711
- Fix merge conflict for auto-merger 25.06 to 25.08 (#18693) @davidwendt
- Fix merge conflict: branch-25.06 into branch-25.08 (#18668) @davidwendt
- Make cuda12 as JNI default (#18651) @pxLi
- Forward-merge branch-25.06 into branch-25.08 (#18647) @bdice
- Fix merge branch-25.06 into branch-25.08 (#18622) @davidwendt
- Store polars Series instead of pyarrow Array in cudf_polars LiteralColumn expr (#18564) @mroeschke
- Refactor strings split/record with whitespace logic (#18560) @davidwendt
- Refactor hash join with multiset (#18021) @PointKernel
- C++
Published by AyodeAwe 7 months ago
https://github.com/rapidsai/cudf - [NIGHTLY] v25.10.00
π Links
π Bug Fixes
- Fix logic for number of unique values generated by data profile in benchmarks (#19540) @shrshi
- Fix value counts expression when the column has nulls (#19524) @Matt711
- Prefer
Column.astypeoverplc.unary.castin the fill null unary function expression (#19479) @Matt711 - Fix missing return in StringFunction.Strptime strict=True path (#19464) @Matt711
- Make dividing a boolean column return f64 dtype in cudf-polars (#19443) @Matt711
- branch-25.10-merge-branch-25.08 (#19429) @davidwendt
π New Features
- Make nvCOMP ZLIB (de)compression available by default (#19528) @vuule
- Add primitive row dispatch support for semi/anti join and cudf::contains (#19518) @PointKernel
- Derive and use page mask at subpass level for chunked reads (#19515) @mhaseeb123
- Implement top k expression in cudf-polars using
cudf::top_k(#19431) @Matt711 - [FEA] Add chunked Parquet sink support using the libcudf writer (#19015) @Matt711
π οΈ Improvements
- Move timeout in cudf.pandas pandas unit tests script to ci script (#19542) @mroeschke
- Get rid of CG logic in the mixed semi-join kernel (#19536) @PointKernel
- Construct more cuDF classic Columns with pylibcudf instead of using Buffers (#19535) @mroeschke
- Fix clang-tools version pinning (#19529) @wence-
- Add cudfpolars unit test for `isin([])` expr (#19525) @mroeschke
- Expose
nvtext::letter_typeto python (#19520) @Matt711 - Add missing import of pyarrow.parquet when reading specified row_groups. (#19509) @bdice
- Don't run serial cudf_pandas tests when testing multiple pandas versions (#19507) @mroeschke
- Add nvtx ranges and minor fix for
liststypes in the next-gen parquet reader (#19493) @mhaseeb123 - Move testavro/testapi_types.py and some DataFrame tests to new cudf classic test directory structure (#19490) @mroeschke
- Move test_series.py to new cudf classic test directory structure (#19485) @mroeschke
- Move test_testing.py to new cudf classic test directory structure (#19481) @mroeschke
- Allow latest OS in devcontainers (#19480) @bdice
- Branch 25.10 merge branch 25.08 (#19475) @davidwendt
- Improve readability when printing pylibcudf enums (#19451) @Matt711
- Use more pytest fixtures and avoid GPU parameterization in cuDF classic tests (#19450) @mroeschke
- Update build infra to support new branching strategy (#19445) @robertmaynard
- Use more pytest fixtures and avoid GPU parameterization in test_indexing/joining/monotonic/multiindex.py (#19437) @mroeschke
- Use more pytest fixtures and avoid GPU parameterization in cuDF classic tests (#19436) @mroeschke
- Update s3 Bucket fixture creation in test_s3 (#19424) @mroeschke
- Use more pytest fixtures and avoid GPU parameterization in cuDF classic tests (#19419) @mroeschke
- Use GCC 14 in conda builds. (#19192) @vyasr
- C++
Published by rapids-bot[bot] 7 months ago
https://github.com/rapidsai/cudf - v25.06.00
π¨ Breaking Changes
- Remove cudf.BaseIndex (#18751) @mroeschke
- Implement
BIT_COUNTunary operation (#18589) @ttnghia - Expose column chunk metadata in
read_parquet_metadata()(#18579) @mhaseeb123 - Fix overflow for
MERGE_M2groupby aggregation (#18546) @ttnghia - Deduplicate parquet physical type enums (#18526) @mhaseeb123
- Implemented String Output & User-data Support for Transforms (#18490) @lamarrr
- Promote Parquet type enums to enum classes (#18441) @mhaseeb123
- Move parquet schema types and structs to public headers (#18424) @mhaseeb123
- Start removal of vector factories with
_syncsuffix by deprecating them and adding versions without the suffix (#18414) @vuule - Skip decoding of pages marked as pruned in PQ reader (#18347) @mhaseeb123
- Deprecate nvtext subword tokenizer (#18334) @davidwendt
- Add standard data ingestion pipelines to pylibcudf for ndarrays (#18311) @Matt711
- Remove extranous modules from top level cudf namespace (#18287) @mroeschke
- Add Keep Option Parameter to Distinct (#18237) @warrickhe
- Update to CCCL 2.8.x with no CCCL patches (#18235) @bdice
π Bug Fixes
- Disable pytest benchmark for Narwhals CI job (#19074) @Matt711
- Avoid undefined behaviour in rollingstoreoutput_functor (#19069) @wence-
- Filter out pkg_resources UserWarning to make nightly CI pass (#19058) @Matt711
- Pin deltalake to <1.0.0 (#19017) @Matt711
- [BUG] Incorrectly getting the caller's frame when searching for locals and globals in cudf.pandas (#18979) @Matt711
- Ensure gc fixture is used in custreamz test (#18915) @TomAugspurger
- Fix a potential segfault in PQ reader's number of rows per source calculation (#18906) @mhaseeb123
- Fix Dataframe
getitemwhenMultiIndexcolumns exist (#18880) @galipremsagar - Ensure eq/ne between Columns in public objects don't return bool (#18875) @mroeschke
- Fix fencepost error in
Repartitiontask generation (#18854) @wence- - Fix cudf_polars pl.col(...).len() always excluding null values (#18849) @mroeschke
- Throw a descriptive exception in Parquet reader when trying to read files with more than two billion rows (#18835) @mhaseeb123
- Skip a decompression test (#18825) @vuule
- Update strings benchmarks to use alloc_size column/table function (#18822) @davidwendt
- Fix host decompression of empty DEFLATE data (#18805) @vuule
- Avoid going OOM in
test_row_limit_exceed_raisesby using dummy array (#18802) @Matt711 - Fix host decompression of empty Snappy data (#18800) @vuule
- Skip test that fails due to polars issue (#18787) @wence-
- Ensure scalar dtype is always set in from_py (#18780) @vyasr
- Fix reading of Snappy compressed Avro files (#18774) @vuule
- Fix missing semicolon in label_bins.cu (#18765) @evanramos-nvidia
- Fix noexcept annotations on stringscolumnview (#18763) @wence-
- Fix integer overflows in pylibcudf
from_column_view_of_arbitrary(#18758) @wence- - Fix overflow case and clean up some logic (#18734) @vyasr
- Link to
nvtx3::nvtx3-cppinstead ofnvToolsExt(#18730) @jakirkham - Revise
DaskIntegrationprotocol to align withrapidsmpf(#18720) @rjzamora - Fix
skip_compressionoption in the Parquet writer with host compression (#18714) @vuule - Add missing header (#18671) @vyasr
- Revert "Set flag to always use unsafe atomic storage" (#18657) @PointKernel
- Fix optional operator* called on a disengaged value in clamp.cu (#18655) @davidwendt
- Add missing header to host_memory.cpp (#18649) @alliepiper
- Fix device compression when writing Parquet files without using nvCOMP (#18644) @vuule
- Add CUDA_ARCHITECTURES setting to cpp-linters script (#18637) @davidwendt
- Pin to cython<3.1 (#18617) @wence-
- Fix
DataFrame.memory_usageoutput order (#18595) @mroeschke - Set flag to always use unsafe atomic storage (#18590) @PointKernel
- Update KvikIO S3 endpoint usage (#18565) @kingcrimsontianyu
- Skip cuml third-party integration tests that may segfault (#18561) @Matt711
- Allow .iloc with cuDF objects as column indexers (#18558) @mroeschke
- Fix overflow for
MERGE_M2groupby aggregation (#18546) @ttnghia - Add back cudf root (#18544) @vyasr
- Change default memory resource for 'distributed' cudf-polars (#18531) @rjzamora
- Fix copy-on-write buffer separation and cleanup (#18530) @galipremsagar
- Fix cpp examples cmake to use the rapids_config.cmake (#18501) @davidwendt
- Rename rapidsmp to rapidsmpf (#18493) @rjzamora
- Fix compilation with the C++20 standard (#18486) @vuule
- Fix an error when reading some compressed Parquet V2 files (#18478) @vuule
- Support title-case characters in strings capitalize() and title() APIs (#18457) @davidwendt
- Ensure DataFrame column label operations reset label_dtype (#18452) @mroeschke
- Fix a segfault when reading a Parquet file with unsupported compression type (#18451) @vuule
- Fix logger macros (#18444) @vyasr
- Fix auto-detection of compression type in host-side decompression (#18440) @shrshi
- Use delete not free to release data allocated with new (#18412) @wence-
- Fix synchronization issues in host compression and decompression (#18395) @vuule
- Update Dask array-conversion handling (#18382) @rjzamora
- Fixed indexing on empty DataFrame with no columns (#18381) @TomAugspurger
- Deterministic hashing for DataFrameScan nodes in cudf-polars multi-partition executor (#18351) @TomAugspurger
- Fix index of right table in unary operators in AST, in Joins (#18333) @karthikeyann
- Add offsetalator to contiguous-split (#18312) @davidwendt
- Support large strings in nvtext vocabulary-tokenizer (#18283) @davidwendt
- Handle empty aggregations in multi-partition cudf.polars group_by (#18277) @TomAugspurger
π Documentation
- Docs for streaming executor options (#18934) @quasiben
- Fix some duplicate toctree issues and improve groupby docs (#18580) @vyasr
- [DOC] Running libcudf benchmarks and comparing output results (#18548) @Matt711
- Fix doxygen usage of the contraction for it is (#18517) @davidwendt
- Clarify @brief tag as description/title on documentation guide (#18515) @davidwendt
- [DOC] Improve clarity in parquet APIs setrowgroups and set_columns parquet (#18466) @Matt711
- Add a usage page to cudf-polars documentation (#18460) @Matt711
- [DOC] Fix typo in CONTRIBUTING.md on build type tests (#18456) @JigaoLuo
- improve docs related to documentation contribution (#18418) @ncclementi
- Add restart kernel note in cudf pandas docs (#18374) @ncclementi
π New Features
- Add CLI argument to enable RMM async memory resource in PDS-H (#18899) @pentschev
- Scan a headerless CSV file with column names provided (#18816) @Matt711
- Add fast paths for
DataFrame.to_cupy(#18801) @Matt711 - Require
numba-cuda>=0.11.0(#18770) @brandon-b-miller - Create a pylibcudf Column from a python iterable (#18768) @Matt711
- Support
ConditianalJoinvia broadcasting in cudf-polars streaming engine (#18723) @rjzamora - Experimental PQ reader utility to calculate total rows in input row groups (#18716) @mhaseeb123
- Extend
explain_queryto support printing the logical plan (pre lowered plan) (#18708) @Matt711 - Reuse
libcudfdependencies for Java JNI build when they are available (#18682) @ttnghia - Add alloc_size member function to cudf::column and cudf::table (#18639) @davidwendt
- Print the physical cudf-polars plan in
pdsh.py(#18635) @rjzamora - String Transform Examples (#18616) @lamarrr
- Add streaming support for
group_by -> n_uniqueto cudf-polars (#18606) @rjzamora - Export cudf compiler flags and definitions (#18604) @ttnghia
- Implement
BIT_COUNTunary operation (#18589) @ttnghia - Expose column chunk metadata in
read_parquet_metadata()(#18579) @mhaseeb123 - Add APIs to check ORC and Parquet compression support at runtime (#18578) @vuule
- Add
Distinctsupport to the cudf-polars streaming executor (#18576) @rjzamora - Add support for large list host Arrow data conversion (#18562) @vyasr
- Implement
BITWISE_AGGaggregations (bitwiseAND,ORandXOR) for sort-based groupby and reduction (#18551) @ttnghia - Implement row group pruning with bloom filters in experimental PQ reader (#18545) @mhaseeb123
- Implement row group pruning with stats in experimental PQ reader (#18543) @mhaseeb123
- [JNI] Expose row-wise sha1 api (#18540) @warrickhe
- Add
Sort+head/tailsupport to streaming cudf-polars executor (#18538) @rjzamora - Add multi-partition MapFunction support to cudf-polars (#18523) @rjzamora
- Adds support for writing raw UTF-8 characters (without escaping) in the JSON writer (#18508) @Matt711
- Support reading from device buffers in the pylibcudf IO APIs (#18496) @Matt711
- Support multi-partition
Selectoperations with aggregations (#18492) @rjzamora - Implemented String Output & User-data Support for Transforms (#18490) @lamarrr
- Add a utility to bulk set multiple null masks (#18489) @mhaseeb123
- High level interface for experimental PQ reader and implementation of metadata APIs (#18480) @mhaseeb123
- Added
pylibcudf.utilities.is_ptds_enabled(#18467) @TomAugspurger - Add a public API for copying a table_view to device array (#18450) @Matt711
- Support
cudf-polarscast_time_unit(#18442) @brandon-b-miller - Support creating a pylibcudf Column from a host array (#18425) @Matt711
- Move parquet schema types and structs to public headers (#18424) @mhaseeb123
- Add optional dtype argument to
Scalar.from_any(#18415) @Matt711 - Expose
cudf::chunked_packin pylibcudf (#18411) @wence- - Add support for long string columns in cudf::contiguous_split (#18393) @nvdbaranec
- Implemented String Input support for Transforms and Removed
jit::column_device_view(#18378) @lamarrr - Automatically dispatch between host and device decompression/compression based on the number of buffers (#18363) @vuule
- Expose join hash table load factor (#18361) @PointKernel
- Skip decoding of pages marked as pruned in PQ reader (#18347) @mhaseeb123
- Sort-based inner join for high-multiplicity tables (#18318) @shrshi
- Support constructing pylibcudf Columns and Tables from views into arbitrary objects (#18314) @vyasr
- Add standard data ingestion pipelines to pylibcudf for ndarrays (#18311) @Matt711
- Support
cudf-polarsisoyearandweek(isoweek) (#18265) @brandon-b-miller - Add Keep Option Parameter to Distinct (#18237) @warrickhe
- Add rapidsmp shuffle support to cudf-polars (#18231) @rjzamora
- Support
cudf-polarsstrftime(#18181) @brandon-b-miller - Add benchmark for join operations with low build table cardinality (#18105) @shrshi
- Add nvtext substring deduplication APIs (Part 2) (#18104) @davidwendt
- Support
include_file_pathsin cudf polars (#18057) @Matt711 - Add support for the Arrow device capsule interfaces (#15370) @vyasr
π οΈ Improvements
- use 'rapids-init-pip' in wheel CI, other CI changes (#18902) @jameslamb
- Avoid RecursionError in custreamz test (#18887) @TomAugspurger
- Update NumPy dependency in cudf.pandas-catboost integration test (#18870) @Matt711
- CPU only execution for PDSH (#18869) @quasiben
- Remove more top level cudf imports in core (#18862) @mroeschke
- Remove top level cudf imports in core (#18857) @mroeschke
- Add CUDFINSTALLDIR for JAVA build script (#18852) @pxLi
- Call the correct
from_pandasinhdfreader (#18850) @galipremsagar - Update
__all__incudf_polars/dsl/ir.py(#18848) @Matt711 - Upload examples conda package (#18847) @vyasr
- Add retries to prevent failures in occasionally slow CI runs (#18843) @galipremsagar
- Finish CUDA 12.9 migration and use branch-25.06 workflows (#18839) @bdice
- Remove toplevel
import cudffrom window/tools/join directories (#18833) @mroeschke - Remove toplevel
import cudffrom cudf/io files (#18829) @mroeschke - Update pdsh benchmark script to support explain-only (#18826) @TomAugspurger
- Refactor UDF utils and add a hook to enable NRT when necessary (#18823) @brandon-b-miller
- Fix memory access error in nvtext::edit_distance (#18821) @davidwendt
- Update to clang 20 (#18818) @bdice
- Reduce more data sizes of Python tests (#18814) @mroeschke
- Mark DataFrame.dtypes as an externalonly_api (#18809) @mroeschke
- Change calls to thrust::swap to cuda::std::swap (#18808) @davidwendt
- Move implemented BaseIndex methods over to Index (#18807) @mroeschke
- Improve pandas version fetching script (#18793) @galipremsagar
- Change cudf::sort googlebench benchmarks to nvbench (#18786) @davidwendt
- Only warn in cudf.pandas if rmm mode explicitly set and rmm already configured (#18785) @jcrist
- Quote head_rev in conda recipes (#18784) @bdice
- Move RangeIndex implementation below Index (#18777) @mroeschke
- Remove unecessary _Ravelled class (#18771) @Matt711
- Remove pytest-rerunfailures (#18766) @mroeschke
- Replace from_arrow with direct calls Column/Table constructors in pylibcudf and cudf-polars tests (#18762) @Matt711
- CUDA 12.9 use updated compression flags (#18755) @robertmaynard
- fix(rattler): add
librmmto host forlibcudfto fix overlinking error (#18754) @gforsyth - Remove the file name from the output in cudf-polars' explain APIs (#18752) @Matt711
- Remove cudf.BaseIndex (#18751) @mroeschke
- Support creating a pylibcudf Column from a general ndarray (#18744) @Matt711
- Improve lowering of
DistinctIR nodes for high-cardinality data (#18725) @rjzamora - Simplify Numba-CUDA MVC logic (#18724) @bdice
- Test with CUDA 12.9.0 (#18721) @bdice
- Add more
cudf.Seriesmicrobenchmarks (#18718) @Matt711 - Run unit-tests-cudf-pandas on branch-25.06 for nightly tests (#18717) @davidwendt
- Move
test_large_unique_categories_reprto benchmarks (#18715) @galipremsagar - Allow
pylibcudf.Columnto consume objects exposing__arrow_c_stream__(#18712) @mroeschke - Switch from printing to logging (#18711) @vyasr
- Add Python tests for different compression implementations (#18710) @vuule
- Remove redundant xfails in cuml integration tests (#18699) @Matt711
- ci: run unit-tests-cudf-pandas on
branch-25.06workflow (#18692) @gforsyth - Exclude librmm.so from auditwheel (#18691) @bdice
- Add C++ tests for different compression implementations (#18690) @vuule
- Improve runtime of cuDF Python unit tests (#18689) @mroeschke
- Require at least numba-cuda
0.10.1(#18688) @brandon-b-miller - Add
nvidia-cuda-{nvrtc, nvcc}as a dependency for cuDF wheels (#18686) @brandon-b-miller - Support rolling aggregations in in-memory cudf-polars execution (#18681) @wence-
- Replace
parquet_blocksizewithtarget_partition_size(#18669) @rjzamora - Skip testlargeuniquecategoriesrepr in CI (#18666) @bdice
- Locally import pyarrow.dataset and fsspec for
import cudfperformance (#18663) @mroeschke - Disable
arm64python tests (#18662) @galipremsagar - Pin numba-cuda>=0.9.0,!=0.10.0 due to CI hangs on ARM (#18661) @mroeschke
- Fix compile warnings in Java JNI (#18660) @ttnghia
- Drop
Emptynodes from IR graph (#18658) @rjzamora - Add support for Python 3.13 (#18648) @gforsyth
- Cleanup libcudf detail/aggregation.hpp/.cuh (#18642) @davidwendt
- Skip all known pytest failures in pandas-tests (#18641) @galipremsagar
- Preserve partitioning after
FilterandProjectionin cudf-polars (#18638) @rjzamora - Support quantile in cudf-polars grouped aggregations (#18634) @wence-
- Deprecate Series.nullmask, Series.nullable, Series.fromcategorical, Series.frommasked_array, cudf.isclose (#18631) @mroeschke
- Access private objects by importing from module instead of
cudf.core/utilnamespace (#18629) @mroeschke - Replace unnecessary cudf::size_of() calls with sizeof() (#18628) @davidwendt
- Improve cold cache dropping (#18626) @kingcrimsontianyu
- Improve default config values for cudf-polars streaming (#18623) @rjzamora
- Add gtest error check for nvtext::wordpiece_tokenize (#18621) @davidwendt
- Polars dataframe serialize using chunked pack (#18614) @madsbk
- xfail all known errors in pandas-test suite (#18612) @galipremsagar
- Add
TemporalBaseColumnas a parent class toDatetimeColumnandTimedeltaColumn(#18611) @mroeschke - Update cudf::cast internal function to use sizeof instead of cudf::size_of (#18607) @davidwendt
- Move cudf/utils/utils.py methods to appropriate locations (#18605) @mroeschke
- pylibcudf.Column: add
device_buffer_sizeand register a dask.sizeof function for cudf-polars Column and DataFrame (#18602) @madsbk - Use
cached_propertyfor Datetime and Timedelta column properties (#18601) @mroeschke - Annotate and simplify
from_arrow(#18600) @mroeschke - Enable reporting peak memory usage for gtests (#18599) @davidwendt
- Prune methods from Frame that are specific to subclasses (#18597) @mroeschke
- Switch
tensorflowintegration tests to use 12.x (#18596) @galipremsagar - refactor: use
libnvcompfromlibkvikiowheel to unblock Python 3.13 upgrade (#18593) @gforsyth - Add temporary pdsh benchmarks to
cudf_polars.experimental(#18592) @rjzamora - Update
numba-cudadependency to>=0.9.0(#18591) @brandon-b-miller - use 'certifi' certificates in fetchpandasversions script (#18588) @jameslamb
- Add nvtext substring duplication APIs (Part 1) (#18585) @davidwendt
- Bump polars version to <1.29 (#18581) @Matt711
- Allow datetime.timedelta objects in pylibcudf.Scalar.from_py (#18577) @mroeschke
- Rework strings split_helper utility for better reuse (#18575) @davidwendt
- Additional tests strings for strings split APIs (#18574) @davidwendt
- Support datetime.datetime objects in pylibcudf.Scalar.from_py (#18572) @mroeschke
- Store Python scalars instead of PyArrow Scalars in cudf_polars Literal expr (#18563) @mroeschke
- Support
plc.Scalar.from_py(None)andplc.Scalar.from_py(int, float type)(#18559) @mroeschke - Add xfail window function tests for cudf_polars (#18557) @btepera
- Add fast paths to
Series.to_cupyandSeries.values(#18555) @Matt711 - Reduce cudf-polars pyarrow usage (#18554) @vyasr
- Avoid possible invalid kernel grid error in
cudf::set_null_masksif no bitmasks to set (#18553) @mhaseeb123 - Adjust cudf Python groupby test for cuCollections update (#18550) @mroeschke
- Refactor scan test I/O logic into shared
make_partitioned_sourcehelper (#18542) @Matt711 - Download build artifacts from Github for CI jobs (#18539) @VenkateshJaya
- Update hypothesis version (#18537) @galipremsagar
- Make Python testing dependencies more specific to pylibcudf vs cudf (#18535) @mroeschke
- Pin hypothesis<6.131.1 due to performance issues (#18532) @mroeschke
- Deduplicate parquet physical type enums (#18526) @mhaseeb123
- Reduce the number of miscellaenous pandas unit tests run with cudf.pandas (#18524) @mroeschke
- Improve nvtext::tokenizewithvocabulary performance (#18522) @davidwendt
- Make pylibcudf.Column.fromrmmbuffer a Python staticmethod (#18521) @mroeschke
- Add more short circuit checks for .equals (#18520) @mroeschke
- Add synchronous task scheduler to cudf-polars (#18519) @rjzamora
- Don't fetch dlpack headers when building cuDF Python (#18518) @mroeschke
- Refactor polars configuration (#18516) @TomAugspurger
- Refactor internal strings utility to separate header and definition file (#18514) @davidwendt
- Fix
print()keyword argument in cudf pandas test (#18513) @trxcllnt - Improve performance of strings split-record on whitespace (#18510) @davidwendt
- Use
cuda::std::iter_value_tinstead of thrust iterator traits (#18509) @miscco - Remove redundant task-graph logic for streaming
GroupBy(#18507) @rjzamora - Replace
GPU_ARCHSbuild variable byCMAKE_CUDA_ARCHITECTURES(#18506) @ttnghia - Optimize pandas metadata generation to reduce memory pressure (#18505) @galipremsagar
- Replace deprecated hostbuffer in favor of hostspan in SourceInfo (#18503) @Matt711
- Add pylibcudf.Column.fromrmmbuffer (#18502) @mroeschke
- Replace thrust functors with libcu++ ones (#18500) @miscco
- Rename cudf-polars executors (#18499) @rjzamora
- Remove casting functions in pylibcudf utils (#18497) @Matt711
- Increase wheel size limit. (#18487) @bdice
- Add CategoricalIndex.from_codes (#18485) @mroeschke
- Split join header (#18484) @shrshi
- Fix unspecified behavior involving move semantics and order of evaluation (#18481) @kingcrimsontianyu
- Remove need for tocudfcompatible_scalar (#18477) @mroeschke
- Rerun flaky pytests in CI (#18476) @galipremsagar
- Vendor RAPIDS.cmake (#18473) @bdice
- Add ARM conda environments. (#18470) @bdice
- Bump polars version to <1.28 (#18469) @Matt711
- Add sink support in cudf_polars (#18468) @mroeschke
- Enable rapidsmpf spilling in cudf-polars (#18461) @madsbk
- Promote Parquet type enums to enum classes (#18441) @mhaseeb123
- Consolidate logic in DataFrame.init for listlike arguments (#18439) @mroeschke
- Update compression formats supported in JSON reader (#18438) @shrshi
- Disabled Jitify Minification (#18436) @lamarrr
- Fix printing decimal128 types that are zero (#18435) @trxcllnt
- Replace direct use of nvCOMP and of its adapter with the higher-level decompression API (#18434) @vuule
- Add more
cudf.DataFrameconstructor pytest benchmarks (#18433) @mroeschke - Test against stable tags for narwhals (#18431) @Matt711
- Refcount-based dropping of cached evaluations in cudf-polars executor (#18430) @wence-
- Replace
Thrustiterator facilities with libcu++ ones (#18427) @miscco - Remove numpy requirement when converting 2d cuda array interface objects to pylibcudf Columns (#18426) @Matt711
- Share more cudf.Column methods for
indices_of/isin(#18423) @mroeschke - Switch the ptr type in gpumemoryview from Pyssizet to uintptr_t (#18419) @Matt711
- Add strings::extract_single API (#18417) @davidwendt
- Add toarrowhost_stringview interop API (#18416) @davidwendt
- Start removal of vector factories with
_syncsuffix by deprecating them and adding versions without the suffix (#18414) @vuule - Allow polars arrow conversion to produce string_view (#18413) @wence-
- Change
dask_cudf.to_parquetbehavior for local filesystems (#18408) @rjzamora - Add rank and label_bin methods to ColumnBase (#18407) @mroeschke
- Improve performance of strings::like for long strings (#18406) @davidwendt
- Automatic single-partition fallback in cudf-polars (#18405) @rjzamora
- Remove
_syncsuffix from hostdevice types (#18404) @vuule - Use owning Arrow types in C++ to expose data to Python (#18402) @vyasr
- add static push and pop methods to NvtxRange (#18401) @zpuller
- Deprecate cudf.Scalar (#18394) @mroeschke
- Bump polars version to <1.27 (#18387) @Matt711
- Branch 25.06 merge 25.04 (#18380) @Matt711
- Silence warning by setting BUILDSHAREDLIBS (#18371) @vyasr
- Rewrite groupby aggregations in cudf-polars to simplify evaluation (#18369) @wence-
- Pass stream through when taking ownership from libcudf (#18367) @wence-
- Expose new groupedrangerolling API in pylibcudf (#18365) @wence-
- Avoid patching sort algorithms from CCCL (#18364) @miscco
- Deprecate old nvtext::normalize_characters (#18360) @davidwendt
- refactor(rattler): enable strict channel priority for builds (#18358) @gforsyth
- Optimize
sequencesby introducingmake_offsets_child_column(#18357) @ustcfy - Decompress all data in a single
decompress_page_datawhen reading Parquet input in a single chunk (#18352) @vuule - Moving wheel builds to specified location and uploading build artifacts to Github (#18346) @VenkateshJaya
- Performance improvement for tolower/toupper for multi-byte UTF-8 characters (#18345) @davidwendt
- Branch 25.06 merge branch 25.04 (#18344) @vyasr
- Use dask-cuda for cudf-polars experimental testing (#18343) @rjzamora
- Deprecate nvtext subword tokenizer (#18334) @davidwendt
- Remove cudf.Scalar in as_column (#18331) @mroeschke
- Add tests for
cudf.polarsto be able to work on a cpu-only machine (#18327) @galipremsagar - Allow
cudf.DataFrame.from_pylibcudfto accept apylibcudf.io.TableWithMetadata(#18319) @mroeschke - Avoid stateful construction in
DataFrame.__init__(#18306) @mroeschke - Improve the groupby performance for extremely low cardinality (#18290) @PointKernel
- Remove extranous modules from top level cudf namespace (#18287) @mroeschke
- Require type annotations in cudf.polars (#18285) @TomAugspurger
- Removing unnecessary StreamSynchronization in reading (#18279) @JigaoLuo
- Update to CCCL 2.8.x with no CCCL patches (#18235) @bdice
- Reduce register pressure for computecolumnkernel (#18226) @matal-nvidia
- Use the mapped buffer for all read operations in the memory-mapped source; switch default source to the kvikIO one (#18204) @vuule
- Improve test coverage in the catboost integration tests (#18126) @Matt711
- Create file sources in parallel (#18094) @vuule
- Enable
stumpy_distributedtests (#17969) @galipremsagar - Refactor distinct join to use primitive row operators when proper (#17726) @PointKernel
- Update chunked parquet reader benchmarks (#16543) @sdrp713
- C++
Published by raydouglass 9 months ago
https://github.com/rapidsai/cudf - [NIGHTLY] v25.08.00
π Links
π¨ Breaking Changes
- Remove deprecated Series methods, isclose (#18947) @mroeschke
- Remove deprecated groupby.collect (#18946) @mroeschke
- Remove deprecated get_dummies(cats=, ...) (#18944) @mroeschke
- Remove cudf.Scalar (#18927) @mroeschke
- Remove deprecated
cudf::io::host_buffer(#18881) @Matt711
π Bug Fixes
- Fix flaky custreamz test (#18961) @TomAugspurger
π Documentation
- Update cuDF Python library design with BaseIndex and pylibcudf updates (#18903) @mroeschke
π New Features
- Add CLI argument to enable OOM protection in PDS-H (#18914) @pentschev
π οΈ Improvements
- add 'rapids-init-pip' to testcudfpolarspolarstests.sh (#18951) @jameslamb
- parameterized ucx / ucxx (#18949) @quasiben
- Remove deprecated Series methods, isclose (#18947) @mroeschke
- Remove deprecated groupby.collect (#18946) @mroeschke
- Remove deprecated get_dummies(cats=, ...) (#18944) @mroeschke
- Add .python_typecode and .typestr attributes to DataType (#18941) @Matt711
- Remove cudf.Scalar (#18927) @mroeschke
- Add #pragma once to prevent redundant includes and speed up compilation (#18925) @PointKernel
- Branch 25.08 merge branch 25.06 (#18895) @vyasr
- Remove deprecated
cudf::io::host_buffer(#18881) @Matt711 - Apply linter suggestions to cuIO code (#18876) @vuule
- xfail pandas unit tests that fail with cudf.pandas (#18872) @mroeschke
- Branch 25.08 merge branch 25.06 (#18855) @vyasr
- Auto merge fix for branch-25.08 (#18824) @davidwendt
- Forward-merge branch-25.06 to branch-25.08 (#18817) @Matt711
- Forward-merge branch-25.06 to branch-25.08 (#18756) @Matt711
- Fix auto merge conflict for branch-25.08 (#18733) @davidwendt
- Forward-merge branch-25.06 to branch-25.08 (#18698) @Matt711
- Fix merge conflict for auto-merger 25.06 to 25.08 (#18693) @davidwendt
- Fix merge conflict: branch-25.06 into branch-25.08 (#18668) @davidwendt
- Make cuda12 as JNI default (#18651) @pxLi
- Forward-merge branch-25.06 into branch-25.08 (#18647) @bdice
- Fix merge branch-25.06 into branch-25.08 (#18622) @davidwendt
- C++
Published by rapids-bot[bot] 9 months ago
https://github.com/rapidsai/cudf - v25.04.00
π¨ Breaking Changes
- Remove unused
group_range_rolling_windowAPI (#18313) @wence- - [BUG] Disabled JIT for CUDA Runtime < 11.5 (#18296) @lamarrr
- Remove cudf.Scalar from binops (#18240) @mroeschke
- Enforce deprecation of dtype parameter in sum/product (#18070) @mroeschke
- Remove deprecated single component datetime extract APIs (#18010) @Matt711
- Remove deprecated rolling window functionality (#17993) @wence-
- Remove deprecated nvtext::minhash_permuted APIs (#17939) @davidwendt
- Remove dataframe protocol (#17909) @vyasr
- Use new rapids-logger library (#17899) @vyasr
- Added Multi-input & Scalar Support for Transform UDFs (#17881) @lamarrr
- Fixed incorrect PTX parsing of
retinstruction after branch label (#17859) @lamarrr - Use KvikIO to enable file's fast host read and host write (#17764) @kingcrimsontianyu
π Bug Fixes
- Fix alpha versions of cudf package. (#18429) @bdice
- Backport: Deterministic hashing for DataFrameScan nodes in cudf-polars multi-partition executor (#18351) (#18420) @bdice
- Skip failing Narwhals rolling groupy tests (#18398) @Matt711
- Pin cmake in test_java to be less than 4.0.0 (#18392) @abellina
- Skip polars tests that fail with pydantic deprecation warnings (#18388) @Matt711
- Backport: Fix index of right table in unary operators in AST, in Joins (#18342) @bdice
- xfail narwhals sqlframe tests (#18297) @Matt711
- [BUG] Disabled JIT for CUDA Runtime < 11.5 (#18296) @lamarrr
- Make a pylibcudf Column from a device array object with
strides=None(#18295) @Matt711 - Fix
cudf.pandasobjects to not beCallable(#18288) @galipremsagar - Skip failing polars test testgeneralprefiltering (#18264) @Matt711
- Filter all cudf.pandas profiler tests from running in parallel (#18262) @Matt711
- Allow cudf.Series([pd.NA], dtype=, nanasnull=False) (#18259) @mroeschke
- Fix
crossjoin with extra columns (#18256) @galipremsagar - Fix
Dataframe.locto not modify the actual dataframe (#18254) @galipremsagar - Remove RMM macro usage from toarrowdevice.cu (#18252) @davidwendt
- Skip Narwhals cross join tests for cudf.pandas CI run (#18249) @Matt711
- Fix cudf-polars tests for polars < 1.24 (#18246) @wence-
- Fix experimental cudf-polars tests (#18244) @rjzamora
- Fix
datetime64vsdatetimebinops max resolution (#18241) @galipremsagar - Use CCCL::libcudacxx include directories in Jitify preprocessing. (#18233) @bdice
- Disable conda prefix patching to avoid mangling binaries (#18225) @vyasr
- Workaround for ARM compiler issue with single space literal string (#18220) @davidwendt
- Bump nightly check limit (#18213) @Matt711
- Support comparitive binops between catgorical and non categorical (#18200) @mroeschke
- Make the version file inside cudf.pandas not a symlink (#18198) @vyasr
- Ensure RAPIDSARTIFACTSDIR is set for build metrics reports. (#18192) @bdice
- Ignore run exports of libcufile. (#18190) @bdice
- Skip flaky multi GPU test (#18187) @Matt711
- Fix BPE merges table static-map capacity size (#18184) @davidwendt
- Drop
CUB_QUOTIENT_CEILING(#18179) @miscco - Disable ARM CI in C++ and Python test CI jobs (#18175) @Matt711
- Add fmt to the test/benchmarks env (#18173) @vyasr
- Fix merge(how=left, lefton=, rightindex=True, sort=True) (#18166) @mroeschke
- Allow nonnative cupy dtype in cudf.Series (#18164) @mroeschke
- Fix Series construction from numpy array with non-native byte order (#18151) @mroeschke
- Use protocol for dlpack instead of deprecated function in cupy notebook (#18147) @Matt711
- Skip failing test (#18146) @vyasr
- Update calls to KvikIO's config setter (#18144) @kingcrimsontianyu
- Reduce memory use when writing tables with very short columns to ORC (#18136) @vuule
- Handle empty dictionary in toarrowdevice interop (#18121) @davidwendt
- Allow pivot_table to accept single label index and column arguments (#18115) @mroeschke
- Preserve DataFrame.column subclass and type during binop (#18113) @mroeschke
- Fix rmm macro call (#18108) @pmattione-nvidia
- Add include for
<functional>(#18102) @miscco - Remove static column vectors from window function tests. (#18099) @mythrocks
- Fix scatterbymap with spilling enabled (#18095) @mroeschke
- Use the right version macro
CCCL_MAJOR_VERSION(#18073) @miscco - Fix
test_scan_csv_multicudf-polars test (#18064) @rjzamora - Fix memcopy direction for concatenate (#18058) @tgujar
- Fix upstream dask
loctest (#18045) @rjzamora - Fix hang on invalid UTF-8 data in string_view iterator (#18039) @davidwendt
- Fix
dask_cudf.to_orcdeprecation (#18038) @rjzamora - Compatibility with dask.dataframe's
is_scalar(#18030) @TomAugspurger - Fix the build error due to KvikIO update (#18025) @kingcrimsontianyu
- Fix failing ibis test (#18022) @Matt711
- Skip failing polars tests (#18015) @Matt711
- Fix
to_arrowto return consistent pandas-metadata (#18009) @galipremsagar - Prevent setting custom attributes to
ColumnMethods(#18005) @galipremsagar - Compatibility with Dask
main(#17992) @TomAugspurger - [Bug] Fix Parquet-metadata sampling in cudf-polars (#17991) @rjzamora
- Add missing include for calling std::iota() (#17983) @davidwendt
- Fix pickle and unpickling for all objects (#17980) @galipremsagar
- Install duckdb the default backend for ibis in the cudf.pandas integration tests (#17972) @Matt711
- Check null count too in sum aggregation (#17964) @Matt711
- Raise NotImplementedError for groupby.agg if duplicate columns would be created (#17956) @mroeschke
- Ensure disabling the module accelerator is thread-safe (#17955) @vyasr
- Fix DataFrame/Series.rank for int and null data in mode.pandas_compatible (#17954) @mroeschke
- Limit buffer size in reallocation policy in JSON reader (#17940) @shrshi
- Make
cudf.pandasproxy array picklable (#17929) @Matt711 - Add missing standard includes (#17928) @miscco
- Fix torch integration test (#17923) @Matt711
- Fix
to_pandaswritable bug fordatetimeandtimedeltatypes (#17913) @galipremsagar - Raise NotImplementedError if
.merge(suffixes=)introduces duplicate labels (#17905) @mroeschke - Fix groupby scans with int and NA data in mode.pandas_compatible (#17895) @mroeschke
- Patch
__init__ofcudfconstructors to parse throughcudf.pandasproxy objects (#17878) @galipremsagar - Fixed incorrect PTX parsing of
retinstruction after branch label (#17859) @lamarrr - Relax inconsistent schema handling in
dask_cudf.read_parquet(#17554) @rjzamora
π Documentation
- Clarify that cudf.pandas should be enabled before importing pandas. (#18339) @bdice
- [DOC] Add wordpiece tokenizer to cudf documentation (#18247) @davidwendt
- Added pylibcudf.contiguous_split to API docs (#18194) @TomAugspurger
- Fix build.sh docs for default behavior (#18180) @bdice
- Update Dask-cuDF documentation to fix all warnings and errors (#18157) @TomAugspurger
- [DOC] Document character normalizer (#18125) @Matt711
π New Features
- Add and revise experimental cudf-polars config options (#18284) @rjzamora
- Support
top-kandbottom_kexpressions (#18222) @Matt711 - Support
cudf-polarsis_leap_year(#18212) @brandon-b-miller - Support
cudf-polarsmonth_start/month_end(#18211) @brandon-b-miller - Support
cudf-polarsordinal_day(#18152) @brandon-b-miller - Add
pylibcudf.gpumemoryviewsupport forlen()/nbytes(#18133) @pentschev - Link to libzstd for ZSTD compression and decompression APIs (#18129) @shrshi
- Added NDSH Q09 Benchmark for Transforms (#18127) @lamarrr
- Make pylibcudf traits raise exceptions gracefully rather than terminating in C++ (#18117) @Matt711
- Host decompression (#18114) @vuule
- Add owning types to hold Arrow data (#18084) @vyasr
- Bump polars version to <1.24 (#18076) @Matt711
- Support sorted merges in cudf.polars (#18075) @Matt711
- Add a slice expression to polars IR (#18050) @Matt711
- Expose
num_rows_per_source(IO metadata) to pylibcudf (#18049) @Matt711 - Added Imbalanced Tree Benchmarks for Transforms (#18032) @lamarrr
- Run the narwhals test suite with cudf.pandas (#18031) @Matt711
- Add
host_read_asyncinterfaces todatasource(#18018) @vuule - Make most cudf-polars
Nodeobjects pickleable (#17998) @rjzamora - Add
Column.serializeto cudf-polars (#17990) @rjzamora - Bump polars version to <1.23 (#17986) @Matt711
- Implemented Decimal Transforms (#17968) @lamarrr
- Introduce ZSTD host-side compression and decompression APIs (#17935) @shrshi
- Add catboost integration tests (#17931) @Matt711
- [FEA] Expose
stripe_size_rowssetting forORCWriterOptions(#17927) @ustcfy - Test narwhals in CI (#17884) @bdice
- Added Multi-input & Scalar Support for Transform UDFs (#17881) @lamarrr
- Host Snappy compression (#17824) @vuule
- Run spark-rapids-jni CI (#17781) @KyleFromNVIDIA
- Add multi-partition
Shuffleoperation to cuDF Polars (#17744) @rjzamora - Added polynomials benchmark (#17695) @lamarrr
- Add stream parameters in pylibcudf IO APIs (#17620) @Matt711
- New nvtext::wordpiece_tokenizer APIs (#17600) @davidwendt
- Add support for unary negation operator (#17560) @Matt711
- Add multi-partition
Joinsupport to cuDF-Polars (#17518) @rjzamora - Add basic multi-partition
GroupBysupport to cuDF-Polars (#17503) @rjzamora - Support Distributed in cudf-polars tests and IR evaluation (#17364) @pentschev
π οΈ Improvements
- Use pyarrow 15 in oldest dependency CI jobs (#18409) @bdice
- Bump librdkafka to 2.8.0 (#18370) @raydouglass
- fix(rattler): ignore
libzlibrun dependency to avoidpandoccollision (#18368) @gforsyth - Fix zstd build interface include definition (#18366) @trxcllnt
- test: Install pytest-env and hypothesis in test_narwhals.sh (#18337) @MarcoGorelli
- Remove unused
group_range_rolling_windowAPI (#18313) @wence- - Cache column view creation from arrow types (#18302) @vyasr
- Split Narwhals cudf.pandas tests failures into to fix and to skip (#18267) @mroeschke
- Support BinOp, min, and max Aggregations in cudf-polars parallel groupby (#18266) @TomAugspurger
- Minor clean up and optimizations in the Parquet writer (#18258) @vuule
- Fix
cudf_kafkarun export forcudatoolkit(#18245) @gforsyth - dask-polars: use splat everywhere. (#18243) @madsbk
- Remove cudf.Scalar from binops (#18240) @mroeschke
- Remove warning in the stream pool when asking for more streams than available (#18236) @vuule
- Explain why we disable parallelism for profiler tests to avoid pytest-cov issue (#18234) @Matt711
- Ignore
cudatoolkitrun exports by name, not package (#18230) @gforsyth - Revert "Bump nightly check limit" (#18227) @Matt711
- Fix
cudf.pandasto be able to work on a cpu-only machine (#18224) @galipremsagar - Add missing
cudatoolkitrun_export ignore topylibcudf(#18223) @gforsyth - Remove cudf.Scalar from Column.setitem (#18221) @mroeschke
- Remove unused rounduppow2 utility (#18218) @PointKernel
- Add flake8-print/debugger Ruff rules (#18217) @mroeschke
- Bump polars version to <1.25 (#18209) @Matt711
- Export RAPIDSARTIFACTSDIR. (#18208) @bdice
- Drop more thrust functions with libcu++ ones (#18207) @miscco
- Update Numpy <2.1 unpinning xfail condition (#18203) @mroeschke
- Run conda import tests on Python packages (#18197) @bdice
- fix(rattler): add
cudatoolkitignore run export tocudf(#18195) @gforsyth - Revert "Disable ARM CI in C++ and Python test CI jobs" (#18188) @Matt711
- Define Column.where to be used across DataFrame/Series (#18186) @mroeschke
- Remove cudf.Scalar in where (#18178) @mroeschke
- Drop unnecessary fmt dep (#18177) @vyasr
- Refactor join internals: separate hash_join declaration and cleanup (#18170) @PointKernel
- Add Ruff rule to enforce cudf dtype utils over numpy/pandas dtype utils (#18169) @mroeschke
- Combine multiple str.minhash() APIs into one call (#18168) @davidwendt
- Move nanoarrowutils.hpp from cpp/tests/interop to cpp/include/cudftest (#18163) @davidwendt
- Test cudf against the latest stable branch of Narwhals (#18162) @Matt711
- fix libcudf pins cu11 (#18161) @gforsyth
- Combine separate ConfigureNVBench calls to fix cpp conda builds (#18155) @gforsyth
- Add telemetry to build workflows (#18154) @gforsyth
- Prune more seldom used dtype utils (#18150) @mroeschke
- Remove some unnecessary module imports (#18143) @mroeschke
- Branch 25.04 merge branch 25.02 (#18142) @vyasr
- Prune some seldom used dtype utils (#18141) @mroeschke
- Use more, cheaper dtype checking utilities in cudf Python (#18139) @mroeschke
- Support deserializing cudf-polars objects composed of RMM frames (#18138) @pentschev
- Add
ConfigOptionsconvenience class to cudf-polars (#18137) @rjzamora - Support new callback API for lazyframe.profile (#18132) @wence-
- Optimized compilation of CUDFTESTUTIL's interface sources (#18131) @lamarrr
- Unpin numpy<2.1 (#18128) @mroeschke
- Use cpu16 for build CI jobs (#18124) @bdice
- Remove now non-existent job (#18123) @vyasr
- Minor typo fix in filling.pxd (#18120) @davidwendt
- Replace more deprecated
CUBfunctors (#18119) @miscco - Simplify DecimalDtype and DecimalColumn operations (#18111) @mroeschke
- Add interop support from arrow StringView to libcudf strings column (#18107) @davidwendt
- Expose the Number of Filtered Parquet Rowgroups (IO Metadata) to pylibcudf (#18106) @JigaoLuo
- Add a list of expected failures to narwhals tests (#18097) @Matt711
- Remove unused var (#18096) @vyasr
- Run narwhals tests nightly. (#18093) @bdice
- Use conda-build instead of conda-mambabuild (#18092) @bdice
- Remove static configure step (#18091) @vyasr
- Remove
FindCUDAToolkit.cmakefrom.pre-commit-config.yaml(#18087) @KyleFromNVIDIA - Align StringColumn constructor with ColumnBase base class (#18086) @mroeschke
- Remove
FindCUDAToolkitbackport (#18081) @KyleFromNVIDIA - Support melt(ignore_index=False) (#18080) @mroeschke
- Update numba dep and upper-bound numpy (#18078) @vyasr
- Add
as_proxy_objectAPI tocudf.pandas(#18072) @galipremsagar - Enforce deprecation of dtype parameter in sum/product (#18070) @mroeschke
- send sccache logs to telemetry (#18069) @msarahan
- Short circuit Index.equal if compared Index isn't same type (#18067) @mroeschke
- Make Column.view/cancastsafely accept a dtype object (#18066) @mroeschke
- Optimization improvement for substr in cudf::string_view (#18062) @davidwendt
- Forward-merge branch-25.02 to branch-25.04 (#18061) @bdice
- Port all conda recipes to
rattler-build(#18054) @gforsyth - Minor improvements in arrow interop (#18053) @wence-
- Pass more dtype objects to
astypecalls (#18044) @mroeschke - Forward merge branch-25.02 to branch-25.04 (#18041) @Matt711
- Replace deprecated CCCL features (#18036) @miscco
- Separate stats filtering helpers to reuse in page pruning (#18034) @mhaseeb123
- Update spark-rapids-jni CI image version to cuda12.8.0 (#18024) @pxLi
- Add pylibcudf.Scalar.from_numpy for bool/int/float/str types (#18020) @mroeschke
- Support IntervalDtype(subtype=None) (#18017) @mroeschke
- Enable pytest-xdist runs for py-polars tests (#18016) @galipremsagar
- consolidate more conda solves in CI (#18014) @jameslamb
- Replace
cub::Int2Typewithcuda::std::integral_constant(#18013) @miscco - Remove deprecated single component datetime extract APIs (#18010) @Matt711
- Pass dtype objects to Column.astype (#18008) @mroeschke
- Require CMake 3.30.4 (#18007) @robertmaynard
- Refactor math_ops.cu dispatcher logic (#18006) @davidwendt
- Move cudf::lists::detail::makeemptylists_column to public API (#17996) @davidwendt
- Create Conda CI test env in one step (#17995) @KyleFromNVIDIA
- Add seed parameter to cudf hashcharacterngrams (#17994) @davidwendt
- Remove deprecated rolling window functionality (#17993) @wence-
- Continue on failures in cudf.pandas integration tests CI job (#17987) @Matt711
- Avoid cudf.dtype calls in buildcolumn/columnempty/.where (#17979) @mroeschke
- Ensure dtype objects are passed within Column.astype (#17978) @mroeschke
- Use Conda XGBoost (#17959) @jakirkham
- Read the footers in parallel when reading multiple Parquet files (#17957) @vuule
- Refactor predicate pushdown to reuse row group pruning in experimental PQ reader (#17946) @mhaseeb123
- Add new nvtext tokenized minhash API (#17944) @davidwendt
- Use shared-workflows branch-25.04 (#17943) @bdice
- Get rid of the deprecated
thrust::identity(#17942) @PointKernel - Remove deprecated nvtext::minhash_permuted APIs (#17939) @davidwendt
- Enable third party library integration tests in CI with
cudf.pandas(#17936) @galipremsagar - Add build_type input field for
test.yaml(#17925) @gforsyth - Remove cudf.Scalar from shift/fillna (#17922) @mroeschke
- Enabling
crossjoin incudfpython (#17921) @galipremsagar - Use
rapids-pip-retryin CI jobs that might need retries (#17920) @gforsyth - More avoid cudf.dtype internally in favor of pre-defined, supported types (#17918) @mroeschke
- Initialize inout parameter (#17911) @miscco
- Remove dataframe protocol (#17909) @vyasr
- Rename PascalCase functions and types to to snake_case to improve consistency (#17908) @vuule
- Use new rapids-logger library (#17899) @vyasr
- Add
pylibcudf.Scalar.from_pyfor construction from Python strings, bool, int, float (#17898) @mroeschke - Remove cudf.Scalar from factorize (#17897) @mroeschke
- disallow fallback to Make in Python builds (#17894) @jameslamb
- Remove
orc::gpunamespace (#17891) @vuule - Only run Auto Assign PR workflow if PR is not merged (#17888) @mroeschke
- Update pre-commit-hooks to version 0.6.0 (#17887) @KyleFromNVIDIA
- Forward-merge branch-25.02 to branch-25.04 (#17885) @bdice
- Add script to run pylibcudf tests (#17882) @bdice
- Migrate to NVKS for amd64 CI runners (#17877) @bdice
- Fix merge conflict for branch-25.02 into branch-25.04 (#17874) @davidwendt
- Remove decimal32/64 to decimal128 conversion in Parquet writer (#17869) @mhaseeb123
- Expose JSON reader options to builder in pylibcudf (#17866) @shrshi
- Remove cudf.Scalar from .dt timedelta properties (#17863) @mroeschke
- Added support for custom types in PTX parser (#17861) @lamarrr
- Remove cudf.Scalar from daterange/todatetime (#17860) @mroeschke
- Avoid
cudf.dtypeinternally in favor of pre-defined, supported types (#17839) @mroeschke - Allow cudf::typetoid<T const>() (#17831) @esoha-nvidia
- Fixing auto-merge branch-25.02 into branch-25.04 (#17828) @davidwendt
- Add new nvtext::normalize_characters API (#17818) @davidwendt
- Include more information in error messages in the nvcomp adapter (#17814) @vuule
- Extend and simplify API for calculation of range-based rolling window offsets (#17807) @wence-
- More minor fixes for CCCL (#17793) @miscco
- Use KvikIO to enable file's fast host read and host write (#17764) @kingcrimsontianyu
- Remove cudf._lib.column in favor of pylibcudf. (#17760) @mroeschke
- Replaced std::string with std::string_view and removed excessive copies in cudf::io (#17734) @lamarrr
- Use xdist worksteal on the
cudf.pandastest suite (#16930) @Matt711
- C++
Published by AyodeAwe 11 months ago
https://github.com/rapidsai/cudf - [NIGHTLY] v25.06.00
π Links
π¨ Breaking Changes
- Promote Parquet type enums to enum classes (#18441) @mhaseeb123
- Move parquet schema types and structs to public headers (#18424) @mhaseeb123
- Start removal of vector factories with
_syncsuffix by deprecating them and adding versions without the suffix (#18414) @vuule - Skip decoding of pages marked as pruned in PQ reader (#18347) @mhaseeb123
- Deprecate nvtext subword tokenizer (#18334) @davidwendt
- Add standard data ingestion pipelines to pylibcudf for ndarrays (#18311) @Matt711
- Add Keep Option Parameter to Distinct (#18237) @warrickhe
π Bug Fixes
- Fix cpp examples cmake to use the rapids_config.cmake (#18501) @davidwendt
- Rename rapidsmp to rapidsmpf (#18493) @rjzamora
- Fix compilation with the C++20 standard (#18486) @vuule
- Fix an error when reading some compressed Parquet V2 files (#18478) @vuule
- Ensure DataFrame column label operations reset label_dtype (#18452) @mroeschke
- Fix a segfault when reading a Parquet file with unsupported compression type (#18451) @vuule
- Fix logger macros (#18444) @vyasr
- Use delete not free to release data allocated with new (#18412) @wence-
- Fix synchronization issues in host compression and decompression (#18395) @vuule
- Update Dask array-conversion handling (#18382) @rjzamora
- Fixed indexing on empty DataFrame with no columns (#18381) @TomAugspurger
- Deterministic hashing for DataFrameScan nodes in cudf-polars multi-partition executor (#18351) @TomAugspurger
- Fix index of right table in unary operators in AST, in Joins (#18333) @karthikeyann
- Add offsetalator to contiguous-split (#18312) @davidwendt
- Support large strings in nvtext vocabulary-tokenizer (#18283) @davidwendt
π Documentation
- [DOC] Improve clarity in parquet APIs setrowgroups and set_columns parquet (#18466) @Matt711
- Add a usage page to cudf-polars documentation (#18460) @Matt711
- [DOC] Fix typo in CONTRIBUTING.md on build type tests (#18456) @JigaoLuo
- Add restart kernel note in cudf pandas docs (#18374) @ncclementi
π New Features
- Support reading from device buffers in the pylibcudf IO APIs (#18496) @Matt711
- Move parquet schema types and structs to public headers (#18424) @mhaseeb123
- Add optional dtype argument to
Scalar.from_any(#18415) @Matt711 - Expose
cudf::chunked_packin pylibcudf (#18411) @wence- - Add support for long string columns in cudf::contiguous_split (#18393) @nvdbaranec
- Automatically dispatch between host and device decompression/compression based on the number of buffers (#18363) @vuule
- Skip decoding of pages marked as pruned in PQ reader (#18347) @mhaseeb123
- Support constructing pylibcudf Columns and Tables from views into arbitrary objects (#18314) @vyasr
- Add standard data ingestion pipelines to pylibcudf for ndarrays (#18311) @Matt711
- Support
cudf-polarsisoyearandweek(isoweek) (#18265) @brandon-b-miller - Add Keep Option Parameter to Distinct (#18237) @warrickhe
- Add rapidsmp shuffle support to cudf-polars (#18231) @rjzamora
- Support
cudf-polarsstrftime(#18181) @brandon-b-miller - Support
include_file_pathsin cudf polars (#18057) @Matt711
π οΈ Improvements
- Optimize pandas metadata generation to reduce memory pressure (#18505) @galipremsagar
- Add pylibcudf.Column.fromrmmbuffer (#18502) @mroeschke
- Replace thrust functors with libcu++ ones (#18500) @miscco
- Rename cudf-polars executors (#18499) @rjzamora
- Remove casting functions in pylibcudf utils (#18497) @Matt711
- Increase wheel size limit. (#18487) @bdice
- Split join header (#18484) @shrshi
- Fix unspecified behavior involving move semantics and order of evaluation (#18481) @kingcrimsontianyu
- Rerun flaky pytests in CI (#18476) @galipremsagar
- Vendor RAPIDS.cmake (#18473) @bdice
- Add ARM conda environments. (#18470) @bdice
- Bump polars version to <1.28 (#18469) @Matt711
- Promote Parquet type enums to enum classes (#18441) @mhaseeb123
- Update compression formats supported in JSON reader (#18438) @shrshi
- Disabled Jitify Minification (#18436) @lamarrr
- Replace direct use of nvCOMP and of its adapter with the higher-level decompression API (#18434) @vuule
- Test against stable tags for narwhals (#18431) @Matt711
- Refcount-based dropping of cached evaluations in cudf-polars executor (#18430) @wence-
- Replace
Thrustiterator facilities with libcu++ ones (#18427) @miscco - Remove numpy requirement when converting 2d cuda array interface objects to pylibcudf Columns (#18426) @Matt711
- Switch the ptr type in gpumemoryview from Pyssizet to uintptr_t (#18419) @Matt711
- Add strings::extract_single API (#18417) @davidwendt
- Start removal of vector factories with
_syncsuffix by deprecating them and adding versions without the suffix (#18414) @vuule - Allow polars arrow conversion to produce string_view (#18413) @wence-
- Add rank and label_bin methods to ColumnBase (#18407) @mroeschke
- Automatic single-partition fallback in cudf-polars (#18405) @rjzamora
- Remove
_syncsuffix from hostdevice types (#18404) @vuule - Use owning Arrow types in C++ to expose data to Python (#18402) @vyasr
- add static push and pop methods to NvtxRange (#18401) @zpuller
- Deprecate cudf.Scalar (#18394) @mroeschke
- Bump polars version to <1.27 (#18387) @Matt711
- Branch 25.06 merge 25.04 (#18380) @Matt711
- Silence warning by setting BUILDSHAREDLIBS (#18371) @vyasr
- Pass stream through when taking ownership from libcudf (#18367) @wence-
- Avoid patching sort algorithms from CCCL (#18364) @miscco
- Deprecate old nvtext::normalize_characters (#18360) @davidwendt
- refactor(rattler): enable strict channel priority for builds (#18358) @gforsyth
- Optimize
sequencesby introducingmake_offsets_child_column(#18357) @ustcfy - Decompress all data in a single
decompress_page_datawhen reading Parquet input in a single chunk (#18352) @vuule - Performance improvement for tolower/toupper for multi-byte UTF-8 characters (#18345) @davidwendt
- Branch 25.06 merge branch 25.04 (#18344) @vyasr
- Use dask-cuda for cudf-polars experimental testing (#18343) @rjzamora
- Deprecate nvtext subword tokenizer (#18334) @davidwendt
- Remove cudf.Scalar in as_column (#18331) @mroeschke
- Allow
cudf.DataFrame.from_pylibcudfto accept apylibcudf.io.TableWithMetadata(#18319) @mroeschke - Avoid stateful construction in
DataFrame.__init__(#18306) @mroeschke - Improve the groupby performance for extremely low cardinality (#18290) @PointKernel
- Require type annotations in cudf.polars (#18285) @TomAugspurger
- Removing unnecessary StreamSynchronization in reading (#18279) @JigaoLuo
- Use the mapped buffer for all read operations in the memory-mapped source; switch default source to the kvikIO one (#18204) @vuule
- Improve test coverage in the catboost integration tests (#18126) @Matt711
- Create file sources in parallel (#18094) @vuule
- C++
Published by rapids-bot[bot] 11 months ago
https://github.com/rapidsai/cudf - v25.02.02
π¨ Breaking Changes
- Expose stream-ordering in scalar and avro APIs (#17766) @shrshi
- Add seed parameter to hashcharacterngrams (#17643) @davidwendt
- Performance improvements and simplifications for fixed size row-based rolling windows (#17623) @wence-
- Refactor distinct hash join to handle multiple probes with the same build table (#17609) @PointKernel
- Deprecate cudf::groupedtimerangerollingwindow (#17589) @wence-
- Remove "legacy" Dask DataFrame support from Dask cuDF (#17558) @rjzamora
- Return empty result for segmented_reduce if input and offsets are both empty (#17437) @davidwendt
- Rework minhash APIs for deprecation cycle (#17421) @davidwendt
- Change indices for dictionary column to signed integer type (#17390) @davidwendt
π Bug Fixes
- Use protocol for dlpack instead of deprecated function (#18134) @vyasr
- Skip the failing connectorx polars tests (#18037) @Matt711
- Fix 'Unexpected short subpass' exception in parquet chunked reader. (#18019) @nvdbaranec
- Fix race check failures in shared memory groupby (#17985) @PointKernel
- Pin
ibisversion in the cudf.pandas integration tests <10.0.0 (#17975) @Matt711 - Fix the index type in the indexing operator of the span types (#17971) @vuule
- Add missing pin (#17915) @vyasr
- Fix third-party
cudf.pandastests (#17900) @galipremsagar - Fix
numpydata access by making attribute private (#17890) @galipremsagar - Remove extra local var declaration from cudf.pandas 3rd-party integration shell script (#17886) @Matt711
- Move
isinstance_cudf_pandastofast_slow_proxy(#17875) @galipremsagar - Make
_Series_dtypemethod a property (#17854) @Matt711 - Fix the bug in determining the heuristics for shared memory groupby (#17851) @PointKernel
- Fix possible OOB mem access in Parquet decoder (#17841) @mhaseeb123
- Require batches to be non-empty in multi-batch JSON reader (#17837) @shrshi
- Fix rolling(minperiods=) with int and null data with mode.pandascompat (#17822) @mroeschke
- Resolve race-condition in
disable_module_accelerator(#17811) @galipremsagar - Make Series(dtype=object) raise in mode.pandas_compat with non string data (#17804) @mroeschke
- Disable intended disabled ORC tests (#17790) @davidwendt
- Fix empty DataFrame construction not returning RangeIndex columns (#17784) @mroeschke
- Fix various
.strmethods for pandas compatability (#17782) @mroeschke - Fix
countAPI issue about ignoring nan values (#17779) @galipremsagar - Add
numbapinning tocudfrepo (#17777) @galipremsagar - Allow .sortvalues(naposition=) to include NaNs in mode.pandas_compatible (#17776) @mroeschke
- allow deselecting nvcomp wheels (#17774) @jameslamb
- Use the
aligned_resource_adaptorto allocate bloom filter device buffers (#17758) @mhaseeb123 - Avoid instantiating bloom filter query function for nested and bool types (#17753) @mhaseeb123
- Fix DataFrame.merge(Series, how="left"/"right") on column and index not resulting in a RangeIndex (#17739) @mroeschke
- [BUG] xfail Polars excel test (#17731) @Matt711
- Require to implement
AutoCloseablefor the classes derived fromHostUDFWrapper(#17727) @ttnghia - Remove jlowe as a java committer since he retired (#17725) @tgravescs
- Prevent use of invalid grid sizes in ORC reader and writer (#17709) @vuule
- Enforce schema for partial tables in multi-source multi-batch JSON reader (#17708) @shrshi
- Compute and use the initial string offset when building
nestedlarge string cols with chunked parquet reader (#17702) @mhaseeb123 - Fix writing of compressed ORC files with large stripe footers (#17700) @vuule
- Fix cudf.polars sum of empty not equalling zero (#17685) @mroeschke
- Fix formatting in logging (#17680) @vuule
- convert all nulls to nans in a specific scenario (#17677) @galipremsagar
- Define cudf repr methods on the Column (#17675) @mroeschke
- Fix groupby.len with null values in cudf.polars (#17671) @mroeschke
- Fix: DataFrameGroupBy.get_group was raising with length>1 tuples (#17653) @MarcoGorelli
- Fix possible int overflow in computemixedjoinoutputsize (#17633) @davidwendt
- Fix a minor potential i32 overflow in
thrust::transform_exclusive_scanin PQ reader preprocessing (#17617) @mhaseeb123 - Fix failing xgboost test in the cudf.pandas third-party integration tests (#17616) @Matt711
- Fix
dask_cudf.read_csv(#17612) @rjzamora - Fix memcheck error in ReplaceTest.NormalizeNansAndZerosMutable gtest (#17610) @davidwendt
- Correctly accept a
pandas.CategoricalDtype(pandas.IntervalDtype(...), ...)type (#17604) @mroeschke - Add ability to modify and propagate
namesofcolumnsobject (#17597) @galipremsagar - Ignore NaN correctly in .quantile (#17593) @mroeschke
- Fix groupby argmin/max gather of sorted-order indices (#17591) @davidwendt
- Fix ctest fail running libcudf tests in a Debug build (#17576) @davidwendt
- Specify a version for rapids_logger dependency (#17573) @jlowe
- Fix the ORC decoding bug for the timestamp data (#17570) @kingcrimsontianyu
- [JNI] remove rmm argument to set rw access for fabric handles (#17553) @abellina
- Document undefined behavior in divroundingup_safe (#17542) @davidwendt
- Fix nvcc-imposed UB in
constexprfunctions (#17534) @vuule - Add anonymous namespace to libcudf test source (#17529) @davidwendt
- Propagate failures in pandas integration tests and Skip failing tests (#17521) @Matt711
- Fix libcudf compile error when logging is disabled (#17512) @davidwendt
- Fix Dask-cuDF
clipAPIs (#17509) @rjzamora - Fix pylibcudf to_arrow with multiple nested data types (#17504) @mroeschke
- Fix groupby(as_index=False).size not reseting index (#17499) @mroeschke
- Revert "Temporarily skip tests due to dask/distributed#8953" (#17492) @Matt711
- Workaround for a misaligned access in
read_csvon some CUDA versions (#17477) @vuule - Fix some possible thread-id overflow calculations (#17473) @davidwendt
- Temporarily skip tests due to dask/distributed#8953 (#17472) @wence-
- Detect mismatches in begin and end tokens returned by JSON tokenizer FST (#17471) @shrshi
- Support dask>=2024.11.2 in Dask cuDF (#17439) @rjzamora
- Fix write_json failure for zero columns in table/struct (#17414) @karthikeyann
- Fix Debug-mode failing Arrow test (#17405) @zeroshade
- Fix all null list column with missing child column in JSON reader (#17348) @karthikeyann
π Documentation
- Fix forward merge 24.12->25.02 (#18002) @raydouglass
- Fix incorrect example in pylibcudf docs (#17912) @Matt711
- Explicitly call out that the GPU open beta runs on a single GPU (#17872) @taureandyernv
- Update cudf.pandas colab link in docs (#17846) @taureandyernv
- [DOC] Make pylibcudf docs more visible (#17803) @Matt711
- Cross-link cudf.pandas profiler documentation. (#17668) @bdice
- Document interpreter install command for cudf.pandas (#17358) @bdice
- add comment to Series.tolist method (#17350) @tequilayu
π New Features
- Bump polars version to <1.22 (#17771) @Matt711
- Make more constexpr available on device for cuIO (#17746) @PointKernel
- Add public interop functions between pylibcudf and cudf classic (#17730) @Matt711
- Support
dask_exprmigration intodask.dataframe(#17704) @rjzamora - Make tests build without relaxed constexpr (#17691) @PointKernel
- Set default logger level to warn (#17684) @vyasr
- Support multithreaded reading of compressed buffers in JSON reader (#17670) @shrshi
- Control pinned memory use with environment variables (#17657) @vuule
- Host compression (#17656) @vuule
- Enable text build without relying on relaxed constexpr (#17647) @PointKernel
- Implement
HOST_UDFaggregation for reduction and segmented reduction (#17645) @ttnghia - Add JSON reader options structs to pylibcudf (#17614) @Matt711
- Refactor distinct hash join to handle multiple probes with the same build table (#17609) @PointKernel
- Add JSON Writer options classes to pylibcudf (#17606) @Matt711
- Add ORC reader options structs to pylibcudf (#17601) @Matt711
- Add Avro Reader options classes to pylibcudf (#17599) @Matt711
- Enable binaryop build without relying on relaxed constexpr (#17598) @PointKernel
- Measure the number of Parquet row groups filtered by predicate pushdown (#17594) @mhaseeb123
- Implement
HOST_UDFaggregation for groupby (#17592) @ttnghia - Plumb pylibcudf.io.parquet options classes through cudf python (#17506) @Matt711
- Add partition-wise
Selectsupport to cuDF-Polars (#17495) @rjzamora - Add multi-partition
Scansupport to cuDF-Polars (#17494) @rjzamora - Migrate
cudf::io::merge_row_group_metadatato pylibcudf (#17491) @Matt711 - Add Parquet Reader options classes to pylibcudf (#17464) @Matt711
- Add multi-partition
DataFrameScansupport to cuDF-Polars (#17441) @rjzamora - Return empty result for segmented_reduce if input and offsets are both empty (#17437) @davidwendt
- Abstract polars function expression nodes to ensure they are serializable (#17418) @pentschev
- Add CSV Reader options classes to pylibcudf (#17412) @Matt711
- Add support for
pylibcudf.DataTypeserialization (#17352) @pentschev - Enable rounding for Decimal32 and Decimal64 in cuDF (#17332) @a-hirota
- Remove upper bounds on cuda-python to allow 12.6.2 and 11.8.5 (#17326) @bdice
- Expose stream-ordering to groupby APIs (#17324) @shrshi
- Migrate ORC Writer to pylibcudf (#17310) @Matt711
- Support reading bloom filters from Parquet files and filter row groups using them (#17289) @mhaseeb123
π οΈ Improvements
- Update to nvcomp 4.2.0.11 (#18042) @bdice
- Remove pandas backend from
cudf.pandas- ibis integration tests (#17945) @Matt711 - Revert CUDA 12.8 shared workflow branch changes (#17879) @vyasr
- Remove predicate param from
DataFrameScanIR (#17852) @Matt711 - Remove cudf.Scalar from scatter APIs (#17847) @mroeschke
- Remove cudf.Scalar from interval_range (#17844) @mroeschke
- Add
verify-codeownershook (#17840) @KyleFromNVIDIA - Build and test with CUDA 12.8.0 (#17834) @bdice
- Increase timeout for recently added test (#17829) @galipremsagar
- Apply ruff everywhere (notebooks and scripts) (#17820) @bdice
- Fix pre-commit.ci failures (#17819) @bdice
- Remove incorrect calls to set architectures (#17813) @vyasr
- Fix typo in exception raised when attempting to convert a string column to cupy (#17800) @dagardner-nv
- Add support for
pyarrow-19(#17794) @galipremsagar - increase parallelism in nightly builds (#17792) @jameslamb
- Reduce libcudf memcheck tests output (#17791) @davidwendt
- Make cudf build with latest CCCL (#17788) @miscco
- Introduce some more rolling window benchmarks (#17787) @wence-
- Add shellcheck to pre-commit and fix warnings (#17778) @gforsyth
- Improve parquet reader very-long string performance (#17773) @pmattione-nvidia
- Update how to manage host UDF instance (#17770) @res-life
- Add getInts api for HostMemoryBuffer and UnsafeMemoryAccessor (#17767) @liurenjie1024
- Expose stream-ordering in scalar and avro APIs (#17766) @shrshi
- Standarize methods used from
cudf.core._internals(#17765) @mroeschke - Implement string join in cudf-polars (#17755) @wence-
- Deprecate dataframe protocol (#17736) @vyasr
- Add parquet reader long row test (#17735) @pmattione-nvidia
- Update kvikio call due to upstream changes (#17733) @kingcrimsontianyu
- Delay setting MultiIndex.level/codes until needed (#17728) @mroeschke
- Bounding pool size in multi-batch JSON reader (#17724) @shrshi
- Use GCC 13 in CUDA 12 conda builds. (#17721) @bdice
- Update minimal sphinx theme version so that we can use parallel doc builds (#17719) @vyasr
- Add more aggregation methods in pylibcudf (#17717) @mroeschke
- Make cudf.lib.stringudf work with pylibcudf Columns instead of cudf._lib Columns (#17715) @mroeschke
- Add special orc test data: timestamp interspersed with null values (#17713) @kingcrimsontianyu
- Add pylibcudf.nullmask.nullcount (#17711) @mroeschke
- Ensure pyarrow.Scalar to pylibcudf.Scalar is cached (#17707) @mroeschke
- Adapt cudf numba config for numba 0.61 removal (#17705) @mroeschke
- Remove cudf._lib.scalar in favor of pylibcudf (#17701) @mroeschke
- Fix parquet reader list bug (#17699) @pmattione-nvidia
- Migrated Dynamic AST Expression Trees in Benchmarks and Tests to use AST Tree (#17697) @lamarrr
- Skip polars test that can generate timezones that chrono_tz doesn't know (#17694) @wence-
- Use 64-bit offsets only if the current strings column output chunk size exceeds threshold (#17693) @mhaseeb123
- Use latest ci-conda images (#17690) @bdice
- Add multi-source reading to JSON reader benchmarks (#17688) @shrshi
- Convert cudf.Scalar usage to pylibcudf and pyarrow usage (#17686) @mroeschke
- remove find_package(Python) in libcudf build (#17683) @jameslamb
- Fix build metrics report format with long placehold filenames (#17679) @davidwendt
- Use rapids-cmake for the logger (#17674) @vyasr
- Java Parquet reads via multiple host buffers (#17673) @jlowe
- Remove cudf._libs.types.pyx (#17665) @mroeschke
- Add support for
Groupby.cumprod(#17661) @galipremsagar - Implement
.dt.total_seconds(#17659) @galipremsagar - Avoid shallow copies in groupby methods (#17646) @mroeschke
- Avoid double MultiIndex factorization in groupby index result (#17644) @mroeschke
- Add seed parameter to hashcharacterngrams (#17643) @davidwendt
- Fix possible overflow in WriteCoalescingCallbackWrapper::TearDown (#17642) @davidwendt
- Remove pragma GCC diagnostic from source files (#17637) @davidwendt
- Move unnecessary utilities from cudf._lib.scalar (#17636) @mroeschke
- Support compression= in DataFrame.to_json (#17634) @mroeschke
- Bump Polars version to <1.18 (#17632) @Matt711
- Add public APIs to Access Underlying
cudfandpandasObjects fromcudf.pandasProxy Objects (#17629) @galipremsagar - Use Numba Config to turn on Pynvjitlink Features (#17628) @isVoid
- Use PyNVML 12 (#17627) @jakirkham
- Remove cudf._lib.utils in favor of python APIs (#17625) @mroeschke
- Performance improvements and simplifications for fixed size row-based rolling windows (#17623) @wence-
- Fix return types for MurmurHash3x8632 template specializations (#17622) @davidwendt
- Clean up namespaces and improve compression-related headers (#17621) @vuule
- Use more pylibcudf.types instead of cudf._lib.types (#17619) @mroeschke
- Remove patch that is only needed for clang-tidy to run on test files (#17618) @vyasr
- update telemetry actions to fluent-bit friendly style (#17615) @msarahan
- Introduce some simple benchmarks for rolling window aggregations (#17613) @wence-
- Bump the oldest
pyarrowversion to14.0.2in test matrix (#17611) @galipremsagar - Use
[[nodiscard]]attribute before__device__(#17608) @vuule - Use
host_vectorinflatten_single_pass_aggs(#17605) @vuule - Stop memory_resource.hpp from including itself (#17603) @vyasr
- Replace the outdated cuco window concept with buckets (#17602) @PointKernel
- Check if nightlies have succeeded recently enough (#17596) @vyasr
- Deprecate cudf::groupedtimerangerollingwindow (#17589) @wence-
- A couple of fixes in rapids-logger usage (#17588) @vyasr
- Simplify expression transformer in Parquet predicate pushdown with
ast::tree(#17587) @mhaseeb123 - Remove unused functionality in cudf._lib.utils.pyx (#17586) @mroeschke
- Use cuda-python
cuda.bindingsimport names. (#17585) @bdice - Use no-sync copy for fixed-width types in cudf::concatenate (#17584) @davidwendt
- Remove cudf._lib.groupby in favor of inlining pylibcudf (#17582) @mroeschke
- Remove unused code of json schema in JSON reader (#17581) @karthikeyann
- Expose Scalar's constructor and
Scalar#getScalarHandle()to public (#17580) @ttnghia - Allow large strings in nvtext benchmarks (#17579) @davidwendt
- Remove cudf._lib.reduce in favor of inlining pylibcudf (#17574) @mroeschke
- Use batched memcpy when writing ORC statistics (#17572) @vuule
- Allow large strings in nvbench strings benchmarks (#17571) @davidwendt
- Update version references in workflow (#17568) @AyodeAwe
- Enable all json reader options in pylibcudf read_json (#17563) @karthikeyann
- Remove cudf._lib.parquet in favor of inlining pylibcudf (#17562) @mroeschke
- Fix CMake format in cudf/_lib/CMakeLists.txt (#17559) @mroeschke
- Remove "legacy" Dask DataFrame support from Dask cuDF (#17558) @rjzamora
- Replace direct
cudaMemcpyAsynccalls with utility functions (within/include) (#17557) @vuule - Remove cudf._lib.interop in favor of inlining pylibcudf (#17555) @mroeschke
- gate telemetry dispatch calls on TELEMETRY_ENABLED env var (#17551) @msarahan
- Replace direct
cudaMemcpyAsynccalls with utility functions (within/src) (#17550) @vuule - Remove unused
BufferArrayFromVector(#17549) @Matt711 - Move cudf.lib.copying to cudf.core.internals (#17548) @mroeschke
- Update cuda-python lower bounds to 12.6.2 / 11.8.5 (#17547) @bdice
- Fix typos, rename types, and add null_probability benchmark axis for distinct (#17546) @PointKernel
- Mark more constexpr functions as device-available (#17545) @vyasr
- Use cooperative-groups instead of cub warp-reduce for strings contains (#17540) @davidwendt
- Remove cudf._lib.nvtext in favor of inlining pylibcudf (#17535) @mroeschke
- Add XXHash_32 hasher (#17533) @PointKernel
- Remove unused masked keyword in column_empty (#17530) @mroeschke
- Remove Thrust patch in favor of CMake definition for Thrust 32-bit offset types. (#17527) @bdice
- [JNI] Enables fabric handles for CUDA async memory pools (#17526) @abellina
- Force Thrust to use 32-bit offset type. (#17523) @bdice
- Replace cudf::detail::copyif logic with thrust::copyif and gather (#17520) @davidwendt
- Replaces uses of
cudf._lib.Column.from_unique_ptrwithpylibcudf.Column.from_libcudf(#17517) @Matt711 - Move cudf.lib.aggregation to cudf.core.internals (#17516) @mroeschke
- Migrate copycolumn and Column.fromscalar to pylibcudf (#17513) @Matt711
- Remove cudf._lib.transform in favor of inlining pylibcudf (#17505) @mroeschke
- Remove cudf._lib.string.convert/split in favor of inlining pylibcudf (#17496) @mroeschke
- Move cudf.lib.sort to cudf.core.internals (#17488) @mroeschke
- Remove cudf._lib.csv in favor in inlining pylibcudf (#17485) @mroeschke
- Update PyTorch to >=2.4.0 to get fix for CUDA array interface bug, and drop CUDA 11 PyTorch tests. (#17475) @bdice
- Remove cudf._lib.binops in favor of inlining pylibcudf (#17468) @mroeschke
- Remove cudf._lib.orc in favor of inlining pylibcudf (#17466) @mroeschke
- skip most CI on devcontainer-only changes (#17465) @jameslamb
- Set build type for all examples (#17463) @vyasr
- Update the hook versions in pre-commit (#17462) @wence-
- Remove cudf.lib.stringcasting in favor of inlining pylibcudf (#17460) @mroeschke
- Remove cudf._lib.filling in favor of inlining pylibcudf (#17459) @mroeschke
- Update MurmurHash3x64128 to use the cuco equivalent implementation (#17457) @PointKernel
- Move cudf.lib.streamcompaction to cudf.core._internals (#17456) @mroeschke
- Clean up xxhash_64 implementations (#17455) @PointKernel
- Update Hadoop dependency in Java pom (#17454) @jlowe
- Adapt to rmm logger changes (#17451) @vyasr
- Require approval to run CI on draft PRs (#17450) @bdice
- Expose stream-ordering in nvtext API (#17446) @shrshi
- Use execpolicynosync in write_json (#17445) @karthikeyann
- Remove cudf._lib.json in favor of inlining pylibcudf (#17443) @mroeschke
- Remove cudf.lib.nullmask in favor of inlining pylibcudf (#17440) @mroeschke
- Expose stream-ordering in replace API (#17436) @shrshi
- Expose stream-ordering in copying APIs (#17435) @shrshi
- Expose stream-ordering in column view APIs (#17434) @shrshi
- Apply clang-tidy autofixes from new rules (#17431) @vyasr
- Remove cudf._lib.round in favor of inlining pylibcudf (#17430) @mroeschke
- Update MurmurHash3x8632 to use the cuco equivalent implementation (#17429) @PointKernel
- Remove cudf._lib.replace in favor of inlining pylibcudf (#17428) @mroeschke
- Remove nvtx/ranges.hpp include from cuda.cuh (#17427) @davidwendt
- Remove the unused detail
int_fastdiv.hheader (#17426) @PointKernel - Remove cudf._lib.lists in favor of inlining pylibcudf (#17425) @mroeschke
- Remove cudf._lib.quantile (#17424) @mroeschke
- Remove cudf._lib.rolling in favor of inlining pylibcudf (#17423) @mroeschke
- Avoid converting Decimal32/Decimal64 in
to_arrowandfrom_arrowAPIs (#17422) @zeroshade - Rework minhash APIs for deprecation cycle (#17421) @davidwendt
- Use threadindextype in binary-ops jit kernel.cu (#17420) @davidwendt
- Change binops for-each kernel to thrust::foreachn (#17419) @davidwendt
- Move cudf.lib.search to cudf.core.internals (#17411) @mroeschke
- Use grid1d utilities in copyrange.cuh (#17409) @davidwendt
- Remove cudf._lib.text in favor of inlining pylibcudf (#17408) @mroeschke
- Run clang-tidy checks in PR CI (#17407) @bdice
- Update strings/text source to use grid_1d for thread/block/stride calculations (#17404) @davidwendt
- Expose stream-ordering to strings attribute APIs (#17398) @shrshi
- Expose stream-ordering to interop APIs (#17397) @shrshi
- Remove unused type aliases (#17396) @PointKernel
- Remove some cudf._lib.strings files in favor of inlining pylibcudf (#17394) @mroeschke
- Update xxhash_64 to utilize the cuco equivalent implementation (#17393) @PointKernel
- Change indices for dictionary column to signed integer type (#17390) @davidwendt
- Return categorical values in tonumpy/tocupy (#17388) @mroeschke
- Forward-merge branch-24.12 to branch-25.02 (#17379) @bdice
- Remove unused IO utilities from cudf python (#17374) @Matt711
- Remove cudf._lib.datetime in favor of inlining pylibcudf (#17372) @mroeschke
- Remove cudf._lib.join in favor of inlining pylibcudf (#17371) @mroeschke
- Remove cudf._lib.merge in favor of inlining pylibcudf (#17370) @mroeschke
- Remove cudf._lib.partitioning in favor of inlining pylibcudf (#17369) @mroeschke
- Remove cudf._lib.reshape in favor of inlining pylibcudf (#17368) @mroeschke
- Remove cudf._lib.timezone in favor of inlining pylibcudf (#17366) @mroeschke
- Remove cudf._lib.transpose in favor of inlining pylibcudf (#17365) @mroeschke
- Move makestringscolumn benchmark to nvbench (#17340) @davidwendt
- Improve strings contains/find performance for smaller strings (#17330) @davidwendt
- Use rapids-logger to generate the cudf logger (#17307) @vyasr
- Mukernels strings (#17286) @pmattione-nvidia
- Add write_parquet to pylibcudf (#17263) @mroeschke
- Single-partition Dask executor for cuDF-Polars (#17262) @rjzamora
- Add breaking change workflow trigger (#17248) @AyodeAwe
- Precompute AST arity (#17234) @bdice
- Update to CCCL 2.7.0-rc2. (#17233) @bdice
- Make
column_emptymask buffer creation consistent with libcudf (#16715) @mroeschke
- C++
Published by raydouglass 12 months ago
https://github.com/rapidsai/cudf - v25.02.01
π¨ Breaking Changes
- Expose stream-ordering in scalar and avro APIs (#17766) @shrshi
- Add seed parameter to hashcharacterngrams (#17643) @davidwendt
- Performance improvements and simplifications for fixed size row-based rolling windows (#17623) @wence-
- Refactor distinct hash join to handle multiple probes with the same build table (#17609) @PointKernel
- Deprecate cudf::groupedtimerangerollingwindow (#17589) @wence-
- Remove "legacy" Dask DataFrame support from Dask cuDF (#17558) @rjzamora
- Return empty result for segmented_reduce if input and offsets are both empty (#17437) @davidwendt
- Rework minhash APIs for deprecation cycle (#17421) @davidwendt
- Change indices for dictionary column to signed integer type (#17390) @davidwendt
π Bug Fixes
- Skip the failing connectorx polars tests (#18037) @Matt711
- Fix 'Unexpected short subpass' exception in parquet chunked reader. (#18019) @nvdbaranec
- Fix race check failures in shared memory groupby (#17985) @PointKernel
- Pin
ibisversion in the cudf.pandas integration tests <10.0.0 (#17975) @Matt711 - Fix the index type in the indexing operator of the span types (#17971) @vuule
- Add missing pin (#17915) @vyasr
- Fix third-party
cudf.pandastests (#17900) @galipremsagar - Fix
numpydata access by making attribute private (#17890) @galipremsagar - Remove extra local var declaration from cudf.pandas 3rd-party integration shell script (#17886) @Matt711
- Move
isinstance_cudf_pandastofast_slow_proxy(#17875) @galipremsagar - Make
_Series_dtypemethod a property (#17854) @Matt711 - Fix the bug in determining the heuristics for shared memory groupby (#17851) @PointKernel
- Fix possible OOB mem access in Parquet decoder (#17841) @mhaseeb123
- Require batches to be non-empty in multi-batch JSON reader (#17837) @shrshi
- Fix rolling(minperiods=) with int and null data with mode.pandascompat (#17822) @mroeschke
- Resolve race-condition in
disable_module_accelerator(#17811) @galipremsagar - Make Series(dtype=object) raise in mode.pandas_compat with non string data (#17804) @mroeschke
- Disable intended disabled ORC tests (#17790) @davidwendt
- Fix empty DataFrame construction not returning RangeIndex columns (#17784) @mroeschke
- Fix various
.strmethods for pandas compatability (#17782) @mroeschke - Fix
countAPI issue about ignoring nan values (#17779) @galipremsagar - Add
numbapinning tocudfrepo (#17777) @galipremsagar - Allow .sortvalues(naposition=) to include NaNs in mode.pandas_compatible (#17776) @mroeschke
- allow deselecting nvcomp wheels (#17774) @jameslamb
- Use the
aligned_resource_adaptorto allocate bloom filter device buffers (#17758) @mhaseeb123 - Avoid instantiating bloom filter query function for nested and bool types (#17753) @mhaseeb123
- Fix DataFrame.merge(Series, how="left"/"right") on column and index not resulting in a RangeIndex (#17739) @mroeschke
- [BUG] xfail Polars excel test (#17731) @Matt711
- Require to implement
AutoCloseablefor the classes derived fromHostUDFWrapper(#17727) @ttnghia - Remove jlowe as a java committer since he retired (#17725) @tgravescs
- Prevent use of invalid grid sizes in ORC reader and writer (#17709) @vuule
- Enforce schema for partial tables in multi-source multi-batch JSON reader (#17708) @shrshi
- Compute and use the initial string offset when building
nestedlarge string cols with chunked parquet reader (#17702) @mhaseeb123 - Fix writing of compressed ORC files with large stripe footers (#17700) @vuule
- Fix cudf.polars sum of empty not equalling zero (#17685) @mroeschke
- Fix formatting in logging (#17680) @vuule
- convert all nulls to nans in a specific scenario (#17677) @galipremsagar
- Define cudf repr methods on the Column (#17675) @mroeschke
- Fix groupby.len with null values in cudf.polars (#17671) @mroeschke
- Fix: DataFrameGroupBy.get_group was raising with length>1 tuples (#17653) @MarcoGorelli
- Fix possible int overflow in computemixedjoinoutputsize (#17633) @davidwendt
- Fix a minor potential i32 overflow in
thrust::transform_exclusive_scanin PQ reader preprocessing (#17617) @mhaseeb123 - Fix failing xgboost test in the cudf.pandas third-party integration tests (#17616) @Matt711
- Fix
dask_cudf.read_csv(#17612) @rjzamora - Fix memcheck error in ReplaceTest.NormalizeNansAndZerosMutable gtest (#17610) @davidwendt
- Correctly accept a
pandas.CategoricalDtype(pandas.IntervalDtype(...), ...)type (#17604) @mroeschke - Add ability to modify and propagate
namesofcolumnsobject (#17597) @galipremsagar - Ignore NaN correctly in .quantile (#17593) @mroeschke
- Fix groupby argmin/max gather of sorted-order indices (#17591) @davidwendt
- Fix ctest fail running libcudf tests in a Debug build (#17576) @davidwendt
- Specify a version for rapids_logger dependency (#17573) @jlowe
- Fix the ORC decoding bug for the timestamp data (#17570) @kingcrimsontianyu
- [JNI] remove rmm argument to set rw access for fabric handles (#17553) @abellina
- Document undefined behavior in divroundingup_safe (#17542) @davidwendt
- Fix nvcc-imposed UB in
constexprfunctions (#17534) @vuule - Add anonymous namespace to libcudf test source (#17529) @davidwendt
- Propagate failures in pandas integration tests and Skip failing tests (#17521) @Matt711
- Fix libcudf compile error when logging is disabled (#17512) @davidwendt
- Fix Dask-cuDF
clipAPIs (#17509) @rjzamora - Fix pylibcudf to_arrow with multiple nested data types (#17504) @mroeschke
- Fix groupby(as_index=False).size not reseting index (#17499) @mroeschke
- Revert "Temporarily skip tests due to dask/distributed#8953" (#17492) @Matt711
- Workaround for a misaligned access in
read_csvon some CUDA versions (#17477) @vuule - Fix some possible thread-id overflow calculations (#17473) @davidwendt
- Temporarily skip tests due to dask/distributed#8953 (#17472) @wence-
- Detect mismatches in begin and end tokens returned by JSON tokenizer FST (#17471) @shrshi
- Support dask>=2024.11.2 in Dask cuDF (#17439) @rjzamora
- Fix write_json failure for zero columns in table/struct (#17414) @karthikeyann
- Fix Debug-mode failing Arrow test (#17405) @zeroshade
- Fix all null list column with missing child column in JSON reader (#17348) @karthikeyann
π Documentation
- Fix forward merge 24.12->25.02 (#18002) @raydouglass
- Fix incorrect example in pylibcudf docs (#17912) @Matt711
- Explicitly call out that the GPU open beta runs on a single GPU (#17872) @taureandyernv
- Update cudf.pandas colab link in docs (#17846) @taureandyernv
- [DOC] Make pylibcudf docs more visible (#17803) @Matt711
- Cross-link cudf.pandas profiler documentation. (#17668) @bdice
- Document interpreter install command for cudf.pandas (#17358) @bdice
- add comment to Series.tolist method (#17350) @tequilayu
π New Features
- Bump polars version to <1.22 (#17771) @Matt711
- Make more constexpr available on device for cuIO (#17746) @PointKernel
- Add public interop functions between pylibcudf and cudf classic (#17730) @Matt711
- Support
dask_exprmigration intodask.dataframe(#17704) @rjzamora - Make tests build without relaxed constexpr (#17691) @PointKernel
- Set default logger level to warn (#17684) @vyasr
- Support multithreaded reading of compressed buffers in JSON reader (#17670) @shrshi
- Control pinned memory use with environment variables (#17657) @vuule
- Host compression (#17656) @vuule
- Enable text build without relying on relaxed constexpr (#17647) @PointKernel
- Implement
HOST_UDFaggregation for reduction and segmented reduction (#17645) @ttnghia - Add JSON reader options structs to pylibcudf (#17614) @Matt711
- Refactor distinct hash join to handle multiple probes with the same build table (#17609) @PointKernel
- Add JSON Writer options classes to pylibcudf (#17606) @Matt711
- Add ORC reader options structs to pylibcudf (#17601) @Matt711
- Add Avro Reader options classes to pylibcudf (#17599) @Matt711
- Enable binaryop build without relying on relaxed constexpr (#17598) @PointKernel
- Measure the number of Parquet row groups filtered by predicate pushdown (#17594) @mhaseeb123
- Implement
HOST_UDFaggregation for groupby (#17592) @ttnghia - Plumb pylibcudf.io.parquet options classes through cudf python (#17506) @Matt711
- Add partition-wise
Selectsupport to cuDF-Polars (#17495) @rjzamora - Add multi-partition
Scansupport to cuDF-Polars (#17494) @rjzamora - Migrate
cudf::io::merge_row_group_metadatato pylibcudf (#17491) @Matt711 - Add Parquet Reader options classes to pylibcudf (#17464) @Matt711
- Add multi-partition
DataFrameScansupport to cuDF-Polars (#17441) @rjzamora - Return empty result for segmented_reduce if input and offsets are both empty (#17437) @davidwendt
- Abstract polars function expression nodes to ensure they are serializable (#17418) @pentschev
- Add CSV Reader options classes to pylibcudf (#17412) @Matt711
- Add support for
pylibcudf.DataTypeserialization (#17352) @pentschev - Enable rounding for Decimal32 and Decimal64 in cuDF (#17332) @a-hirota
- Remove upper bounds on cuda-python to allow 12.6.2 and 11.8.5 (#17326) @bdice
- Expose stream-ordering to groupby APIs (#17324) @shrshi
- Migrate ORC Writer to pylibcudf (#17310) @Matt711
- Support reading bloom filters from Parquet files and filter row groups using them (#17289) @mhaseeb123
π οΈ Improvements
- Update to nvcomp 4.2.0.11 (#18042) @bdice
- Remove pandas backend from
cudf.pandas- ibis integration tests (#17945) @Matt711 - Revert CUDA 12.8 shared workflow branch changes (#17879) @vyasr
- Remove predicate param from
DataFrameScanIR (#17852) @Matt711 - Remove cudf.Scalar from scatter APIs (#17847) @mroeschke
- Remove cudf.Scalar from interval_range (#17844) @mroeschke
- Add
verify-codeownershook (#17840) @KyleFromNVIDIA - Build and test with CUDA 12.8.0 (#17834) @bdice
- Increase timeout for recently added test (#17829) @galipremsagar
- Apply ruff everywhere (notebooks and scripts) (#17820) @bdice
- Fix pre-commit.ci failures (#17819) @bdice
- Remove incorrect calls to set architectures (#17813) @vyasr
- Fix typo in exception raised when attempting to convert a string column to cupy (#17800) @dagardner-nv
- Add support for
pyarrow-19(#17794) @galipremsagar - increase parallelism in nightly builds (#17792) @jameslamb
- Reduce libcudf memcheck tests output (#17791) @davidwendt
- Make cudf build with latest CCCL (#17788) @miscco
- Introduce some more rolling window benchmarks (#17787) @wence-
- Add shellcheck to pre-commit and fix warnings (#17778) @gforsyth
- Improve parquet reader very-long string performance (#17773) @pmattione-nvidia
- Update how to manage host UDF instance (#17770) @res-life
- Add getInts api for HostMemoryBuffer and UnsafeMemoryAccessor (#17767) @liurenjie1024
- Expose stream-ordering in scalar and avro APIs (#17766) @shrshi
- Standarize methods used from
cudf.core._internals(#17765) @mroeschke - Implement string join in cudf-polars (#17755) @wence-
- Deprecate dataframe protocol (#17736) @vyasr
- Add parquet reader long row test (#17735) @pmattione-nvidia
- Update kvikio call due to upstream changes (#17733) @kingcrimsontianyu
- Delay setting MultiIndex.level/codes until needed (#17728) @mroeschke
- Bounding pool size in multi-batch JSON reader (#17724) @shrshi
- Use GCC 13 in CUDA 12 conda builds. (#17721) @bdice
- Update minimal sphinx theme version so that we can use parallel doc builds (#17719) @vyasr
- Add more aggregation methods in pylibcudf (#17717) @mroeschke
- Make cudf.lib.stringudf work with pylibcudf Columns instead of cudf._lib Columns (#17715) @mroeschke
- Add special orc test data: timestamp interspersed with null values (#17713) @kingcrimsontianyu
- Add pylibcudf.nullmask.nullcount (#17711) @mroeschke
- Ensure pyarrow.Scalar to pylibcudf.Scalar is cached (#17707) @mroeschke
- Adapt cudf numba config for numba 0.61 removal (#17705) @mroeschke
- Remove cudf._lib.scalar in favor of pylibcudf (#17701) @mroeschke
- Fix parquet reader list bug (#17699) @pmattione-nvidia
- Migrated Dynamic AST Expression Trees in Benchmarks and Tests to use AST Tree (#17697) @lamarrr
- Skip polars test that can generate timezones that chrono_tz doesn't know (#17694) @wence-
- Use 64-bit offsets only if the current strings column output chunk size exceeds threshold (#17693) @mhaseeb123
- Use latest ci-conda images (#17690) @bdice
- Add multi-source reading to JSON reader benchmarks (#17688) @shrshi
- Convert cudf.Scalar usage to pylibcudf and pyarrow usage (#17686) @mroeschke
- remove find_package(Python) in libcudf build (#17683) @jameslamb
- Fix build metrics report format with long placehold filenames (#17679) @davidwendt
- Use rapids-cmake for the logger (#17674) @vyasr
- Java Parquet reads via multiple host buffers (#17673) @jlowe
- Remove cudf._libs.types.pyx (#17665) @mroeschke
- Add support for
Groupby.cumprod(#17661) @galipremsagar - Implement
.dt.total_seconds(#17659) @galipremsagar - Avoid shallow copies in groupby methods (#17646) @mroeschke
- Avoid double MultiIndex factorization in groupby index result (#17644) @mroeschke
- Add seed parameter to hashcharacterngrams (#17643) @davidwendt
- Fix possible overflow in WriteCoalescingCallbackWrapper::TearDown (#17642) @davidwendt
- Remove pragma GCC diagnostic from source files (#17637) @davidwendt
- Move unnecessary utilities from cudf._lib.scalar (#17636) @mroeschke
- Support compression= in DataFrame.to_json (#17634) @mroeschke
- Bump Polars version to <1.18 (#17632) @Matt711
- Add public APIs to Access Underlying
cudfandpandasObjects fromcudf.pandasProxy Objects (#17629) @galipremsagar - Use Numba Config to turn on Pynvjitlink Features (#17628) @isVoid
- Use PyNVML 12 (#17627) @jakirkham
- Remove cudf._lib.utils in favor of python APIs (#17625) @mroeschke
- Performance improvements and simplifications for fixed size row-based rolling windows (#17623) @wence-
- Fix return types for MurmurHash3x8632 template specializations (#17622) @davidwendt
- Clean up namespaces and improve compression-related headers (#17621) @vuule
- Use more pylibcudf.types instead of cudf._lib.types (#17619) @mroeschke
- Remove patch that is only needed for clang-tidy to run on test files (#17618) @vyasr
- update telemetry actions to fluent-bit friendly style (#17615) @msarahan
- Introduce some simple benchmarks for rolling window aggregations (#17613) @wence-
- Bump the oldest
pyarrowversion to14.0.2in test matrix (#17611) @galipremsagar - Use
[[nodiscard]]attribute before__device__(#17608) @vuule - Use
host_vectorinflatten_single_pass_aggs(#17605) @vuule - Stop memory_resource.hpp from including itself (#17603) @vyasr
- Replace the outdated cuco window concept with buckets (#17602) @PointKernel
- Check if nightlies have succeeded recently enough (#17596) @vyasr
- Deprecate cudf::groupedtimerangerollingwindow (#17589) @wence-
- A couple of fixes in rapids-logger usage (#17588) @vyasr
- Simplify expression transformer in Parquet predicate pushdown with
ast::tree(#17587) @mhaseeb123 - Remove unused functionality in cudf._lib.utils.pyx (#17586) @mroeschke
- Use cuda-python
cuda.bindingsimport names. (#17585) @bdice - Use no-sync copy for fixed-width types in cudf::concatenate (#17584) @davidwendt
- Remove cudf._lib.groupby in favor of inlining pylibcudf (#17582) @mroeschke
- Remove unused code of json schema in JSON reader (#17581) @karthikeyann
- Expose Scalar's constructor and
Scalar#getScalarHandle()to public (#17580) @ttnghia - Allow large strings in nvtext benchmarks (#17579) @davidwendt
- Remove cudf._lib.reduce in favor of inlining pylibcudf (#17574) @mroeschke
- Use batched memcpy when writing ORC statistics (#17572) @vuule
- Allow large strings in nvbench strings benchmarks (#17571) @davidwendt
- Update version references in workflow (#17568) @AyodeAwe
- Enable all json reader options in pylibcudf read_json (#17563) @karthikeyann
- Remove cudf._lib.parquet in favor of inlining pylibcudf (#17562) @mroeschke
- Fix CMake format in cudf/_lib/CMakeLists.txt (#17559) @mroeschke
- Remove "legacy" Dask DataFrame support from Dask cuDF (#17558) @rjzamora
- Replace direct
cudaMemcpyAsynccalls with utility functions (within/include) (#17557) @vuule - Remove cudf._lib.interop in favor of inlining pylibcudf (#17555) @mroeschke
- gate telemetry dispatch calls on TELEMETRY_ENABLED env var (#17551) @msarahan
- Replace direct
cudaMemcpyAsynccalls with utility functions (within/src) (#17550) @vuule - Remove unused
BufferArrayFromVector(#17549) @Matt711 - Move cudf.lib.copying to cudf.core.internals (#17548) @mroeschke
- Update cuda-python lower bounds to 12.6.2 / 11.8.5 (#17547) @bdice
- Fix typos, rename types, and add null_probability benchmark axis for distinct (#17546) @PointKernel
- Mark more constexpr functions as device-available (#17545) @vyasr
- Use cooperative-groups instead of cub warp-reduce for strings contains (#17540) @davidwendt
- Remove cudf._lib.nvtext in favor of inlining pylibcudf (#17535) @mroeschke
- Add XXHash_32 hasher (#17533) @PointKernel
- Remove unused masked keyword in column_empty (#17530) @mroeschke
- Remove Thrust patch in favor of CMake definition for Thrust 32-bit offset types. (#17527) @bdice
- [JNI] Enables fabric handles for CUDA async memory pools (#17526) @abellina
- Force Thrust to use 32-bit offset type. (#17523) @bdice
- Replace cudf::detail::copyif logic with thrust::copyif and gather (#17520) @davidwendt
- Replaces uses of
cudf._lib.Column.from_unique_ptrwithpylibcudf.Column.from_libcudf(#17517) @Matt711 - Move cudf.lib.aggregation to cudf.core.internals (#17516) @mroeschke
- Migrate copycolumn and Column.fromscalar to pylibcudf (#17513) @Matt711
- Remove cudf._lib.transform in favor of inlining pylibcudf (#17505) @mroeschke
- Remove cudf._lib.string.convert/split in favor of inlining pylibcudf (#17496) @mroeschke
- Move cudf.lib.sort to cudf.core.internals (#17488) @mroeschke
- Remove cudf._lib.csv in favor in inlining pylibcudf (#17485) @mroeschke
- Update PyTorch to >=2.4.0 to get fix for CUDA array interface bug, and drop CUDA 11 PyTorch tests. (#17475) @bdice
- Remove cudf._lib.binops in favor of inlining pylibcudf (#17468) @mroeschke
- Remove cudf._lib.orc in favor of inlining pylibcudf (#17466) @mroeschke
- skip most CI on devcontainer-only changes (#17465) @jameslamb
- Set build type for all examples (#17463) @vyasr
- Update the hook versions in pre-commit (#17462) @wence-
- Remove cudf.lib.stringcasting in favor of inlining pylibcudf (#17460) @mroeschke
- Remove cudf._lib.filling in favor of inlining pylibcudf (#17459) @mroeschke
- Update MurmurHash3x64128 to use the cuco equivalent implementation (#17457) @PointKernel
- Move cudf.lib.streamcompaction to cudf.core._internals (#17456) @mroeschke
- Clean up xxhash_64 implementations (#17455) @PointKernel
- Update Hadoop dependency in Java pom (#17454) @jlowe
- Adapt to rmm logger changes (#17451) @vyasr
- Require approval to run CI on draft PRs (#17450) @bdice
- Expose stream-ordering in nvtext API (#17446) @shrshi
- Use execpolicynosync in write_json (#17445) @karthikeyann
- Remove cudf._lib.json in favor of inlining pylibcudf (#17443) @mroeschke
- Remove cudf.lib.nullmask in favor of inlining pylibcudf (#17440) @mroeschke
- Expose stream-ordering in replace API (#17436) @shrshi
- Expose stream-ordering in copying APIs (#17435) @shrshi
- Expose stream-ordering in column view APIs (#17434) @shrshi
- Apply clang-tidy autofixes from new rules (#17431) @vyasr
- Remove cudf._lib.round in favor of inlining pylibcudf (#17430) @mroeschke
- Update MurmurHash3x8632 to use the cuco equivalent implementation (#17429) @PointKernel
- Remove cudf._lib.replace in favor of inlining pylibcudf (#17428) @mroeschke
- Remove nvtx/ranges.hpp include from cuda.cuh (#17427) @davidwendt
- Remove the unused detail
int_fastdiv.hheader (#17426) @PointKernel - Remove cudf._lib.lists in favor of inlining pylibcudf (#17425) @mroeschke
- Remove cudf._lib.quantile (#17424) @mroeschke
- Remove cudf._lib.rolling in favor of inlining pylibcudf (#17423) @mroeschke
- Avoid converting Decimal32/Decimal64 in
to_arrowandfrom_arrowAPIs (#17422) @zeroshade - Rework minhash APIs for deprecation cycle (#17421) @davidwendt
- Use threadindextype in binary-ops jit kernel.cu (#17420) @davidwendt
- Change binops for-each kernel to thrust::foreachn (#17419) @davidwendt
- Move cudf.lib.search to cudf.core.internals (#17411) @mroeschke
- Use grid1d utilities in copyrange.cuh (#17409) @davidwendt
- Remove cudf._lib.text in favor of inlining pylibcudf (#17408) @mroeschke
- Run clang-tidy checks in PR CI (#17407) @bdice
- Update strings/text source to use grid_1d for thread/block/stride calculations (#17404) @davidwendt
- Expose stream-ordering to strings attribute APIs (#17398) @shrshi
- Expose stream-ordering to interop APIs (#17397) @shrshi
- Remove unused type aliases (#17396) @PointKernel
- Remove some cudf._lib.strings files in favor of inlining pylibcudf (#17394) @mroeschke
- Update xxhash_64 to utilize the cuco equivalent implementation (#17393) @PointKernel
- Change indices for dictionary column to signed integer type (#17390) @davidwendt
- Return categorical values in tonumpy/tocupy (#17388) @mroeschke
- Forward-merge branch-24.12 to branch-25.02 (#17379) @bdice
- Remove unused IO utilities from cudf python (#17374) @Matt711
- Remove cudf._lib.datetime in favor of inlining pylibcudf (#17372) @mroeschke
- Remove cudf._lib.join in favor of inlining pylibcudf (#17371) @mroeschke
- Remove cudf._lib.merge in favor of inlining pylibcudf (#17370) @mroeschke
- Remove cudf._lib.partitioning in favor of inlining pylibcudf (#17369) @mroeschke
- Remove cudf._lib.reshape in favor of inlining pylibcudf (#17368) @mroeschke
- Remove cudf._lib.timezone in favor of inlining pylibcudf (#17366) @mroeschke
- Remove cudf._lib.transpose in favor of inlining pylibcudf (#17365) @mroeschke
- Move makestringscolumn benchmark to nvbench (#17340) @davidwendt
- Improve strings contains/find performance for smaller strings (#17330) @davidwendt
- Use rapids-logger to generate the cudf logger (#17307) @vyasr
- Mukernels strings (#17286) @pmattione-nvidia
- Add write_parquet to pylibcudf (#17263) @mroeschke
- Single-partition Dask executor for cuDF-Polars (#17262) @rjzamora
- Add breaking change workflow trigger (#17248) @AyodeAwe
- Precompute AST arity (#17234) @bdice
- Update to CCCL 2.7.0-rc2. (#17233) @bdice
- Make
column_emptymask buffer creation consistent with libcudf (#16715) @mroeschke
- C++
Published by AyodeAwe 12 months ago
https://github.com/rapidsai/cudf - v24.12.00
π¨ Breaking Changes
- Fix reading Parquet string cols when
nrowsandinput_pass_limit> 0 (#17321) @mhaseeb123 - prefer wheel-provided libcudf.so in loadlibrary(), use RTLDLOCAL (#17316) @jameslamb
- Deprecate single component extraction methods in libcudf (#17221) @Matt711
- Move detail header floating_conversion.hpp to detail subdirectory (#17209) @davidwendt
- Refactor Dask cuDF legacy code (#17205) @rjzamora
- Make HostMemoryBuffer call into the DefaultHostMemoryAllocator (#17204) @revans2
- Remove java reservation (#17189) @revans2
- Separate evaluation logic from
IRobjects in cudf-polars (#17175) @rjzamora - Upgrade to polars 1.11 in cudf-polars (#17154) @wence-
- Remove the additional host register calls initially intended for performance improvement on Grace Hopper (#17092) @kingcrimsontianyu
- Correctly set
is_device_accesiblewhen creatinghost_spans from other container/span types (#17079) @vuule - Unify treatment of
ExprandIRnodes in cudf-polars DSL (#17016) @wence- - Deprecate support for directly accessing logger (#16964) @vyasr
- Made cudftestutil header-only and removed GTest dependency (#16839) @lamarrr
π Bug Fixes
- Turn off cudf.pandas 3rd party integrations tests for 24.12 (#17500) @Matt711
- Ignore errors when testing glibc versions (#17389) @vyasr
- Adapt to KvikIO API change in the compatibility mode (#17377) @kingcrimsontianyu
- Support pivot with index or column arguments as lists (#17373) @mroeschke
- Deselect failing polars tests (#17362) @pentschev
- Fix integer overflow in compiled binaryop (#17354) @wence-
- Update cmake to 3.28.6 in JNI Dockerfile (#17342) @jlowe
- fix library-loading issues in editable installs (#17338) @jameslamb
- Bug fix: restrict lines=True to JSON format in Kafka read_gdf method (#17333) @a-hirota
- Fix various issues with
replaceAPI and add support indatetimeandtimedeltacolumns (#17331) @galipremsagar - Do not exclude nanoarrow and flatbuffers from installation if statically linked (#17322) @hyperbolic2346
- Fix reading Parquet string cols when
nrowsandinput_pass_limit> 0 (#17321) @mhaseeb123 - Remove another reference to
FindcuFile(#17315) @KyleFromNVIDIA - Fix reading of single-row unterminated CSV files (#17305) @vuule
- Fixed lifetime issue in ast transform tests (#17292) @lamarrr
- Switch to using
TaskSpec(#17285) @galipremsagar - Fix datatype ctor call in JSONTEST (#17273) @davidwendt
- Expose delimiter character in JSON reader options to JSON reader APIs (#17266) @shrshi
- Fix extract-datetime deprecation warning in ndsh benchmark (#17254) @davidwendt
- Disallow cuda-python 12.6.1 and 11.8.4 (#17253) @bdice
- Wrap custom iterator result (#17251) @galipremsagar
- Fix binop with LHS numpy datetimelike scalar (#17226) @mroeschke
- Fix
Dataframe.__setitem__slow-downs (#17222) @galipremsagar - Fix groupby.get_group with length-1 tuple with list-like grouper (#17216) @mroeschke
- Fix discoverability of submodules inside
pd.util(#17215) @galipremsagar - Fix
Schema.Builderdoes not propagate precision value toBuilderinstance (#17214) @ttnghia - Mark column chunks in a PQ reader
passas large strings when the cumulativeoffsetsexceeds the large strings threshold. (#17207) @mhaseeb123 - [BUG] Replace
repo_tokenwithgithub_tokenin Auto Assign PR GHA (#17203) @Matt711 - Remove unsanitized nulls from input strings columns in reduction gtests (#17202) @davidwendt
- Fix
to_parquetappend behavior with global metadata file (#17198) @rjzamora - Check
num_children() == 0inColumn.from_column_view(#17193) @cwharris - Fix host-to-device copy missing sync in strings/duration convert (#17149) @davidwendt
- Add JNI Support for Multi-line Delimiters and Include Test (#17139) @SurajAralihalli
- Ignore loud dask warnings about legacy dataframe implementation (#17137) @galipremsagar
- Fix the GDS read/write segfault/bus error when the cuFile policy is set to GDS or ALWAYS (#17122) @kingcrimsontianyu
- Fix
DataFrame._from_arraysand introduce validations (#17112) @galipremsagar - [Bug] Fix Arrow-FS parquet reader for larger files (#17099) @rjzamora
- Fix bug in recovering invalid lines in JSONL inputs (#17098) @shrshi
- Reenable huge pages for arrow host copying (#17097) @vyasr
- Correctly set
is_device_accesiblewhen creatinghost_spans from other container/span types (#17079) @vuule - Fix ORC reader when using
device_read_asyncwhile the destination device buffers are not ready (#17074) @ttnghia - Fix regex handling of fixed quantifier with 0 range (#17067) @davidwendt
- Limit the number of keys to calculate column sizes and page starts in PQ reader to 1B (#17059) @mhaseeb123
- Adding assertion to check for regular JSON inputs of size greater than
INT_MAXbytes (#17057) @shrshi - bug fix: use
self.ck_consumerinpollmethod of kafka.py to align with__init__(#17044) @a-hirota - Disable kvikio remote I/O to avoid openssl dependencies in JNI build (#17026) @pxLi
- Fix
host_spanconstructor to correctly copyis_device_accessible(#17020) @vuule - Add pinning for pyarrow in wheels (#17018) @vyasr
- Use std::optional for host types (#17015) @robertmaynard
- Fix write_json to handle empty string column (#16995) @karthikeyann
- Restore export of nvcomp outside of wheel builds (#16988) @KyleFromNVIDIA
- Allow melt(var_name=) to be a falsy label (#16981) @mroeschke
- Fix astype from tz-aware type to tz-aware type (#16980) @mroeschke
- Use
libcudfwheel from PR rather than nightly forpolars-polarsCI test job (#16975) @brandon-b-miller - Fix order-preservation in pandas-compat unsorted groupby (#16942) @wence-
- Fix cudf::strings::findall error with empty input (#16928) @davidwendt
- Fix JsonLargeReaderTest.MultiBatch use of LIBCUDFJSONBATCH_SIZE env var (#16927) @davidwendt
- Parse newline as whitespace character while tokenizing JSONL inputs with non-newline delimiter (#16923) @shrshi
- Respect groupby.nunique(dropna=False) (#16921) @mroeschke
- Update all rmm imports to use pylibrmm/librmm (#16913) @Matt711
- Fix order-preservation in cudf-polars groupby (#16907) @wence-
- Add a shortcut for when the input clusters are all empty for the tdigest merge (#16897) @jihoonson
- Properly handle the mapped and registered regions in
memory_mapped_source(#16865) @vuule - Fix performance regression for generatecharacterngrams (#16849) @davidwendt
- Fix regex parsing logic handling of nested quantifiers (#16798) @davidwendt
- Compute whole column variance using numerically stable approach (#16448) @wence-
π Documentation
- Add documentation for low memory readers (#17314) @btepera
- Fix the example in documentation for
get_dremel_data()(#17242) @mhaseeb123 - Fix some documentation rendering for pylibcudf (#17217) @mroeschke
- Move detail header floating_conversion.hpp to detail subdirectory (#17209) @davidwendt
- Add TokenizeVocabulary to api docs (#17208) @davidwendt
- Add jaccard_index to generated cuDF docs (#17199) @davidwendt
- [no ci] Add empty-columns section to the libcudf developer guide (#17183) @davidwendt
- Add 2-cpp approvers text to contributing guide no ci @davidwendt
- Changing developer guide int64t to int64_t (#17130) @hyperbolic2346
- docs: change 'CSV' to 'csv' in python/custreamz/README.md to match kafka.py (#17041) @a-hirota
- [DOC] Document limitation using
cudf.pandasproxy arrays (#16955) @Matt711 - [DOC] Document environment variable for failing on fallback in
cudf.pandas(#16932) @Matt711
π New Features
- Add version config (#17312) @vyasr
- Java JNI for Multiple contains (#17281) @res-life
- Add
cudf::calendrical_month_sequenceto pylibcudf (#17277) @Matt711 - Raise errors on specific types of fallback in
cudf.pandas(#17268) @Matt711 - Add
catboostto the third-party integration tests (#17267) @Matt711 - Add type stubs for pylibcudf (#17258) @wence-
- Use pylibcudf contiguous split APIs in cudf python (#17246) @Matt711
- Upgrade nvcomp to 4.1.0.6 (#17201) @bdice
- Added Arrow Interop Benchmarks (#17194) @lamarrr
- Rewrite Java API
Table.readJSONto return the output from libcudfread_jsondirectly (#17180) @ttnghia - Support storing
precisionof decimal types inSchemaclass (#17176) @ttnghia - Migrate CSV writer to pylibcudf (#17163) @Matt711
- Add computesharedmemory_aggs used by shared memory groupby (#17162) @PointKernel
- Added ast tree to simplify expression lifetime management (#17156) @lamarrr
- Add computemappingindices used by shared memory groupby (#17147) @PointKernel
- Add remaining datetime APIs to pylibcudf (#17143) @Matt711
- Added strings AST vs BINARY_OP benchmarks (#17128) @lamarrr
- Use
libcudf_exception_handlerthroughoutpylibcudf.libcudf(#17109) @brandon-b-miller - Include timezone file path in error message (#17102) @bdice
- Migrate NVText Byte Pair Encoding APIs to pylibcudf (#17101) @Matt711
- Migrate NVText Tokenizing APIs to pylibcudf (#17100) @Matt711
- Migrate NVtext subword tokenizing APIs to pylibcudf (#17096) @Matt711
- Migrate NVText Stemming APIs to pylibcudf (#17085) @Matt711
- Migrate NVText Replacing APIs to pylibcudf (#17084) @Matt711
- Add IWYU to CI (#17078) @vyasr
cudf-polarsstring/numeric casting (#17076) @brandon-b-miller- Migrate NVText Normalizing APIs to Pylibcudf (#17072) @Matt711
- Migrate remaining nvtext NGrams APIs to pylibcudf (#17070) @Matt711
- Add profilers to CUDA 12 conda devcontainers (#17066) @vyasr
- Add conda recipe for cudf-polars (#17037) @bdice
- Implement batch construction for strings columns (#17035) @ttnghia
- Add device aggregators used by shared memory groupby (#17031) @PointKernel
- Add optional column_order in JSON reader (#17029) @karthikeyann
- Migrate Min Hashing APIs to pylibcudf (#17021) @Matt711
- Reorganize
cudf_polarsexpression code (#17014) @brandon-b-miller - Migrate nvtext jaccard API to pylibcudf (#17007) @Matt711
- Migrate nvtext generate_ngrams APIs to pylibcudf (#17006) @Matt711
- Control whether a file data source memory-maps the file with an environment variable (#17004) @vuule
- Switched BINARY_OP Benchmarks from GoogleBench to NVBench (#16963) @lamarrr
- [FEA] Report all unsupported operations for a query in cudf.polars (#16960) @Matt711
- [FEA] Migrate nvtext/edit_distance APIs to pylibcudf (#16957) @Matt711
- Switched AST benchmarks from GoogleBench to NVBench (#16952) @lamarrr
- Extend
device_scalarto optionally use pinned bounce buffer (#16947) @vuule - Implement
cudf-polarschunked parquet reading (#16944) @brandon-b-miller - Expose streams in public round APIs (#16925) @Matt711
- add telemetry setup to test (#16924) @msarahan
- Add cudf::strings::contains_multiple (#16900) @davidwendt
- Made cudftestutil header-only and removed GTest dependency (#16839) @lamarrr
- Add an example to demonstrate multithreaded
read_parquetpipelines (#16828) @mhaseeb123 - Implement
extract_datetime_componentinlibcudf/pylibcudf(#16776) @brandon-b-miller - Add cudf::strings::find_re API (#16742) @davidwendt
- Migrate hashing operations to
pylibcudf(#15418) @brandon-b-miller
π οΈ Improvements
- Simplify serialization protocols (#17552) @vyasr
- Add
pynvmlas a dependency fordask-cudf(#17386) @pentschev - Enable unified memory by default in
cudf_polars(#17375) @galipremsagar - Support polars 1.14 (#17355) @wence-
- Remove cudf._lib.quantiles in favor of inlining pylibcudf (#17347) @mroeschke
- Remove cudf._lib.labeling in favor of inlining pylibcudf (#17346) @mroeschke
- Remove cudf._lib.hash in favor of inlining pylibcudf (#17345) @mroeschke
- Remove cudf._lib.concat in favor of inlining pylibcudf (#17344) @mroeschke
- Extract
GPUEngineconfig options at translation time (#17339) @rjzamora - Update java datetime APIs to match CUDF. (#17329) @revans2
- Move strings url_decode benchmarks to nvbench (#17328) @davidwendt
- Move strings translate benchmarks to nvbench (#17325) @davidwendt
- Writing compressed output using JSON writer (#17323) @shrshi
- Test the full matrix for polars and dask wheels on nightlies (#17320) @vyasr
- Remove cudf._lib.avro in favor of inlining pylicudf (#17319) @mroeschke
- Move cudf.lib.unary to cudf.core.internals (#17318) @mroeschke
- prefer wheel-provided libcudf.so in loadlibrary(), use RTLDLOCAL (#17316) @jameslamb
- Clean up misc, unneeded pylibcudf.libcudf in cudf._lib (#17309) @mroeschke
- Exclude nanoarrow and flatbuffers from installation (#17308) @vyasr
- Update CI jobs to include Polars in nightlies and improve IWYU (#17306) @vyasr
- Move strings repeat benchmarks to nvbench (#17304) @davidwendt
- Fix synchronization bug in bool parquet mukernels (#17302) @pmattione-nvidia
- Move strings replace benchmarks to nvbench (#17301) @davidwendt
- Support polars 1.13 (#17299) @wence-
- Replace FindcuFile with upstream FindCUDAToolkit support (#17298) @KyleFromNVIDIA
- Expose stream-ordering in public transpose API (#17294) @shrshi
- Replace workaround of JNI build with CUDFKVIKIOREMOTE_IO=OFF (#17293) @pxLi
- cmake option:
CUDF_KVIKIO_REMOTE_IO(#17291) @madsbk - Use more pylibcudf Python enums in cudf._lib (#17288) @mroeschke
- Use pylibcudf enums in cudf Python quantile (#17287) @mroeschke
- enforce wheel size limits, README formatting in CI (#17284) @jameslamb
- Use numba-cuda<0.0.18 (#17280) @gmarkall
- Add computecolumnexpression to pylibcudf for transform.compute_column (#17279) @mroeschke
- Optimize distinct inner join to use set
findinstead ofretrieve(#17278) @PointKernel - remove WheelHelpers.cmake (#17276) @jameslamb
- Plumb pylibcudf datetime APIs through cudf python (#17275) @Matt711
- Follow up making Python tests more deterministic (#17272) @mroeschke
- Use pylibcudf.search APIs in cudf python (#17271) @Matt711
- Use
pylibcudf.strings.convert.convert_integers.is_integerin cudf python (#17270) @Matt711 - Move strings filter benchmarks to nvbench (#17269) @davidwendt
- Make constructor of DeviceMemoryBufferView public (#17265) @liurenjie1024
- Put a ceiling on cuda-python (#17264) @jameslamb
- Always prefer
device_reads anddevice_writes when kvikIO is enabled (#17260) @vuule - Expose streams in public quantile APIs (#17257) @shrshi
- Add support for
pyarrow-18(#17256) @galipremsagar - Move strings/numeric convert benchmarks to nvbench (#17255) @davidwendt
- Add new
dask_cudf.read_parquetAPI (#17250) @rjzamora - Add readparquetmetadata to pylibcudf (#17245) @mroeschke
- Search for kvikio with lowercase (#17243) @vyasr
- KvikIO shared library (#17239) @madsbk
- Use more pylibcudf.io.types enums in cudf._libs (#17237) @mroeschke
- Expose mixed and conditional joins in pylibcudf (#17235) @wence-
- Add io.text APIs to pylibcudf (#17232) @mroeschke
- Add
num_iterationsaxis to the multi-threaded Parquet benchmarks (#17231) @vuule - Move strings to date/time types benchmarks to nvbench (#17229) @davidwendt
- Support for polars 1.12 in cudf-polars (#17227) @wence-
- Allow generating large strings in benchmarks (#17224) @davidwendt
- Refactor gather/scatter benchmarks for strings (#17223) @davidwendt
- Deprecate single component extraction methods in libcudf (#17221) @Matt711
- Remove
nvtext::load_vocabularyfrom pylibcudf (#17220) @Matt711 - Benchmarking JSON reader for compressed inputs (#17219) @shrshi
- Expose stream-ordering in partitioning API (#17213) @shrshi
- Move strings::concatenate benchmark to nvbench (#17211) @davidwendt
- Expose stream-ordering in subword tokenizer API (#17206) @shrshi
- Refactor Dask cuDF legacy code (#17205) @rjzamora
- Make HostMemoryBuffer call into the DefaultHostMemoryAllocator (#17204) @revans2
- Unified binary_ops and ast benchmarks parameter names (#17200) @lamarrr
- Add in new java API for raw host memory allocation (#17197) @revans2
- Remove java reservation (#17189) @revans2
- Fixed unused attribute compilation error for GCC 13 (#17188) @lamarrr
- Change default KvikIO parameters in cuDF: set the thread pool size to 4, and compatibility mode to ON (#17185) @kingcrimsontianyu
- Use makedeviceuvector instead of cudaMemcpyAsync in inplacebitmaskbinop (#17181) @davidwendt
- Make ai.rapids.cudf.HostMemoryBuffer#copyFromStream public. (#17179) @liurenjie1024
- Separate evaluation logic from
IRobjects in cudf-polars (#17175) @rjzamora - Move nvtext ngrams benchmarks to nvbench (#17173) @davidwendt
- Remove includes suggested by include-what-you-use (#17170) @vyasr
- Reading multi-source compressed JSONL files (#17161) @shrshi
- Process parquet bools with microkernels (#17157) @pmattione-nvidia
- Upgrade to polars 1.11 in cudf-polars (#17154) @wence-
- Deprecate current libcudf nvtext minhash functions (#17152) @davidwendt
- Remove unused variable in internal merge_tdigests utility (#17151) @davidwendt
- Use the full ref name of
rmm.DeviceBufferin the sphinx config file (#17150) @Matt711 - Move
segmented_gatherfunction from the copying module to the lists module (#17148) @Matt711 - Use async execution policy for true_if (#17146) @PointKernel
- Add conversion from cudf-polars expressions to libcudf ast for parquet filters (#17141) @wence-
- devcontainer: replace
VAULT_HOSTwithAWS_ROLE_ARN(#17134) @jjacobelli - Replace direct
cudaMemcpyAsynccalls with utility functions (limited tocudf::io) (#17132) @vuule - use rapids-generate-pip-constraints to pin to oldest dependencies in CI (#17131) @jameslamb
- Set the default number of threads in KvikIO thread pool to 8 (#17126) @kingcrimsontianyu
- Fix clang-tidy violations for span.hpp and hostdevice_vector.hpp (#17124) @davidwendt
- Disable the Parquet reader's wide lists tables GTest by default (#17120) @mhaseeb123
- Add compile time check to ensure the
counting_iteratortype incounting_transform_iteratorfits insize_type(#17118) @mhaseeb123 - Minor I/O code quality improvements (#17105) @kingcrimsontianyu
- Remove the additional host register calls initially intended for performance improvement on Grace Hopper (#17092) @kingcrimsontianyu
- Split hash-based groupby into multiple smaller files to reduce build time (#17089) @PointKernel
- build wheels without build isolation (#17088) @jameslamb
- Polars: DataFrame Serialization (#17062) @madsbk
- Remove unused hash helper functions (#17056) @PointKernel
- Add todlpack/fromdlpack APIs to pylibcudf (#17055) @mroeschke
- Move
flatten_single_pass_aggsto its own TU (#17053) @PointKernel - Replace deprecated cuco APIs with updated versions (#17052) @PointKernel
- Refactor ORC dictionary encoding to migrate to the new
cuco::static_map(#17049) @mhaseeb123 - Move pylibcudf/libcudf/wrappers/decimals to pylibcudf/libcudf/fixed_point (#17048) @mroeschke
- make conda installs in CI stricter (part 2) (#17042) @jameslamb
- Use managed memory for NDSH benchmarks (#17039) @karthikeyann
- Clean up hash-groupby
var_hash_functor(#17034) @PointKernel - Add json APIs to pylibcudf (#17025) @mroeschke
- Add string.replace_re APIs to pylibcudf (#17023) @mroeschke
- Replace old host tree algorithm with new algorithm in JSON reader (#17019) @karthikeyann
- Unify treatment of
ExprandIRnodes in cudf-polars DSL (#17016) @wence- - make conda installs in CI stricter (#17013) @jameslamb
- Pylibcudf: pack and unpack (#17012) @madsbk
- Remove unneeded pylibcudf.libcudf.wrappers.duration usage in cudf (#17010) @mroeschke
- Add custom "fused" groupby aggregation to Dask cuDF (#17009) @rjzamora
- Make tests more deterministic (#17008) @galipremsagar
- Remove unused import (#17005) @Matt711
- Add string.convert.convert_urls APIs to pylibcudf (#17003) @mroeschke
- Add release tracking to project automation scripts (#17001) @jarmak-nv
- Implement inequality joins by translation to conditional joins (#17000) @wence-
- Add string.convert.convert_lists APIs to pylibcudf (#16997) @mroeschke
- Performance optimization of JSON validation (#16996) @karthikeyann
- Add string.convert.convert_ipv4 APIs to pylibcudf (#16994) @mroeschke
- Add string.convert.convert_integers APIs to pylibcudf (#16991) @mroeschke
- Add string.convert_floats APIs to pylibcudf (#16990) @mroeschke
- Add string.convert.convertfixedtype APIs to pylibcudf (#16984) @mroeschke
- Remove unnecessary
std::move's in pylibcudf (#16983) @Matt711 - Add docstrings and test for strings.convert_durations APIs for pylibcudf (#16982) @mroeschke
- JSON tokenizer memory optimizations (#16978) @shrshi
- Turn on
xfail_strict = truefor all python packages (#16977) @wence- - Add string.convert.convertdatetime/convertbooleans APIs to pylibcudf (#16971) @mroeschke
- Auto assign PR to author (#16969) @Matt711
- Deprecate support for directly accessing logger (#16964) @vyasr
- Expunge NamedColumn (#16962) @wence-
- Add clang-tidy to CI (#16958) @vyasr
- Address all remaining clang-tidy errors (#16956) @vyasr
- Apply clang-tidy autofixes (#16949) @vyasr
- Use nvcomp wheel instead of bundling nvcomp (#16946) @KyleFromNVIDIA
- Refactor the
cuda_memcpyfunctions to make them more usable (#16945) @vuule - Add string.split APIs to pylibcudf (#16940) @mroeschke
- clang-tidy fixes part 3 (#16939) @vyasr
- clang-tidy fixes part 2 (#16938) @vyasr
- clang-tidy fixes part 1 (#16937) @vyasr
- Add string.wrap APIs to pylibcudf (#16935) @mroeschke
- Add string.translate APIs to pylibcudf (#16934) @mroeschke
- Add string.find_multiple APIs to pylibcudf (#16920) @mroeschke
- Batch memcpy the last offsets for output buffers of str and list cols in PQ reader (#16905) @mhaseeb123
- reduce wheel build verbosity, narrow deprecation warning filter (#16896) @jameslamb
- Improve aggregation device functors (#16884) @PointKernel
- Upgrade pandas pinnings to support
2.2.3(#16882) @galipremsagar - Fix 24.10 to 24.12 forward merge (#16876) @bdice
- Manually resolve conflicts in between branch-24.12 and branch-24.10 (#16871) @galipremsagar
- Add in support for setting delim when parsing JSON through java (#16867) @revans2
- Reapply
mixed_semi_joinrefactoring and bug fixes (#16859) @mhaseeb123 - Add string padding and side_type APIs to pylibcudf (#16833) @mroeschke
- Organize parquet reader mukernel non-nullable code, introduce manual block scans (#16830) @pmattione-nvidia
- Remove superfluous use of std::vector for std::future (#16829) @kingcrimsontianyu
- Rework
read_csvIO to avoid reading whole input with a singlehost_read(#16826) @vuule - Add strings.combine APIs to pylibcudf (#16790) @mroeschke
- Add remaining string.char_types APIs to pylibcudf (#16788) @mroeschke
- Add new nvtext minhash_permuted API (#16756) @davidwendt
- Avoid public constructors when called with columns to avoid unnecessary validation (#16747) @mroeschke
- Use
changed-filesshared workflow (#16713) @KyleFromNVIDIA - lint: replace
isortwith Ruff's rule I (#16685) @Borda - Improve the performance of low cardinality groupby (#16619) @PointKernel
- Parquet reader list microkernel (#16538) @pmattione-nvidia
- AWS S3 IO through KvikIO (#16499) @madsbk
- Refactor
histogramreduction usingcuco::static_set::insert_and_find(#16485) @srinivasyadav18 - Use numba-cuda>=0.0.13 (#16474) @gmarkall
- C++
Published by GPUtester about 1 year ago
https://github.com/rapidsai/cudf - v24.10.01
This hotfix corrected some python packaging issues.
Full Changelog: https://github.com/rapidsai/cudf/compare/v24.10.00...v24.10.01
- C++
Published by raydouglass over 1 year ago
https://github.com/rapidsai/cudf - v24.10.00
π¨ Breaking Changes
- Whitespace normalization of nested column coerced as string column in JSONL inputs (#16759) @shrshi
- Add libcudf wrappers around currentdeviceresource functions. (#16679) @harrism
- Fix empty cluster handling in tdigest merge (#16675) @jihoonson
- Remove java ColumnView.copyWithBooleanColumnAsValidity (#16660) @revans2
- Support reading multiple PQ sources with mismatching nullability for columns (#16639) @mhaseeb123
- Remove arrowiosource (#16607) @vyasr
- Remove legacy Arrow interop APIs (#16590) @vyasr
- Remove NativeFile support from cudf Python (#16589) @vyasr
- Revert "Make proxy NumPy arrays pass isinstance check in
cudf.pandas" (#16586) @Matt711 - Align public utility function signatures with pandas 2.x (#16565) @mroeschke
- Disallow cudf.Index accepting column in favor of .fromcolumn (#16549) @mroeschke
- Refactor dictionary encoding in PQ writer to migrate to the new
cuco::static_map(#16541) @mhaseeb123 - Change IPv4 convert APIs to support UINT32 instead of INT64 (#16489) @davidwendt
- enable list to be forced as string in JSON reader. (#16472) @karthikeyann
- Disallow cudf.Series to accept column in favor of
._from_column(#16454) @mroeschke - Align groupby APIs with pandas 2.x (#16403) @mroeschke
- Align misc DataFrame and MultiIndex methods with pandas 2.x (#16402) @mroeschke
- Align Index APIs with pandas 2.x (#16361) @mroeschke
- Add
streamparam to stream compaction APIs (#16295) @JayjeetAtGithub
π Bug Fixes
- Add license to the pylibcudf wheel (#16976) @raydouglass
- Parse newline as whitespace character while tokenizing JSONL inputs with non-newline delimiter (#16950) @shrshi
- Add dask-cudf workaround for missing
rename_axissupport in cudf (#16899) @rjzamora - Update oldest deps for
pyarrow&numpy(#16883) @galipremsagar - Update labeler for pylibcudf (#16868) @vyasr
- Revert "Refactor mixedsemijoin using cuco::static_set" (#16855) @mhaseeb123
- Fix metadata after implicit array conversion from Dask cuDF (#16842) @rjzamora
- Add cudf.pandas dependencies.yaml to update-version.sh (#16840) @raydouglass
- Use cupy 12.2.0 as oldest dependency pinning on CUDA 12 ARM (#16808) @bdice
- Revert "Fix empty cluster handling in tdigest merge (#16675)" (#16800) @jihoonson
- Intentionally leak thread_local CUDA resources to avoid crash (part 1) (#16787) @kingcrimsontianyu
- Fix
cov/corrbug in dask-cudf (#16786) @rjzamora - Fix slice_strings wide strings logic with multi-byte characters (#16777) @davidwendt
- Fix nvbench output for sha512 (#16773) @davidwendt
- Allow readcsv(header=None) to return int column labels in `mode.pandascompatible` (#16769) @mroeschke
- Whitespace normalization of nested column coerced as string column in JSONL inputs (#16759) @shrshi
- Fix DataFrame.drop(columns=cudf.Series/Index, axis=1) (#16712) @mroeschke
- Use merge base when calculating changed files (#16709) @KyleFromNVIDIA
- Ensure we pass the hasnulls tparam to mixedjoin kernels (#16708) @abellina
- Add boost-devel to Java CI Docker image (#16707) @jlowe
- [BUG] Add gpu node type to cudf-pandas 3rd-party integration nightly CI job (#16704) @Matt711
- Fix typo in column_factories.hpp comment from 'depth 1' to 'depth 2' (#16700) @a-hirota
- Fix Series.to_frame(name=None) setting a None name (#16698) @mroeschke
- Disable gtests/ERROR_TEST during compute-sanitizer memcheck test (#16691) @davidwendt
- Enable batched multi-source reading of JSONL files with large records (#16687) @shrshi
- Handle
orderedparameter inCategoricalIndex.__repr__(#16683) @galipremsagar - Fix loc/iloc.setitem[:, loc] with non cupy types (#16677) @mroeschke
- Fix empty cluster handling in tdigest merge (#16675) @jihoonson
- Fix
cudf::ranknot getting enough params (#16666) @JayjeetAtGithub - Fix slowdown in
CategoricalIndex.__repr__(#16665) @galipremsagar - Remove java ColumnView.copyWithBooleanColumnAsValidity (#16660) @revans2
- Fix slowdown in DataFrame repr in jupyter notebook (#16656) @galipremsagar
- Preserve Series name in duplicated method. (#16655) @bdice
- Fix interval_range right child non-zero offset (#16651) @mroeschke
- fix libcudf wheel publishing, make package-type explicit in wheel publishing (#16650) @jameslamb
- Revert "Hide all gtest symbols in cudftestutil (#16546)" (#16644) @robertmaynard
- Fix integer overflow in indexalator pointer logic (#16643) @davidwendt
- Allow for binops between two differently sized DecimalDtypes (#16638) @mroeschke
- Move pragma once in rolling/jit/operation.hpp. (#16636) @bdice
- Fix overflow bug in low-memory JSON reader (#16632) @shrshi
- Add the missing
num_aggregationsaxis forgroupby_max_cardinality(#16630) @PointKernel - Fix strings::detail::copy_range when target contains nulls (#16626) @davidwendt
- Fix function parameters with common dependency modified during their evaluation (#16620) @ttnghia
- bug-fix: Don't enable the CUDA language if testing was requested when finding cudf (#16615) @cryos
- bug-fix: cudf/io/json.hpp use after move (#16609) @NicolasDenoyelle
- Remove CUDA whole compilation ODR violations (#16603) @robertmaynard
- MAINT: Adapt to numpy hiding flagsobject away (#16593) @seberg
- Revert "Make proxy NumPy arrays pass isinstance check in
cudf.pandas" (#16586) @Matt711 - Switch python version to
3.10incudf.pandaspandas test scripts (#16559) @galipremsagar - Hide all gtest symbols in cudftestutil (#16546) @robertmaynard
- Update the java code to properly deal with lists being returned as strings (#16536) @revans2
- Register
read_parquetandread_csvwith dask-expr (#16535) @rjzamora - Change cudf::empty_like to not include offsets for empty strings columns (#16529) @davidwendt
- Fix DataFrame reductions with median returning scalar instead of Series (#16527) @mroeschke
- Allow DataFrame.sort_values(by=) to select an index level (#16519) @mroeschke
- Fix
date_range(start, end, freq)when end-start is divisible by freq (#16516) @mroeschke - Preserve array name in MultiIndex.from_arrays (#16515) @mroeschke
- Disallow indexing by selecting duplicate labels (#16514) @mroeschke
- Fix
.replace(Index, Index)raising a TypeError (#16513) @mroeschke - Check index bounds in compact protocol reader. (#16493) @bdice
- Fix build failures with GCC 13 (#16488) @PointKernel
- Fix all-empty input column for strings split APIs (#16466) @davidwendt
- Fix segmented-sort overlapped input/output indices (#16463) @davidwendt
- Fix merge conflict for auto merge 16447 (#16449) @davidwendt
π Documentation
- Fix links in Dask cuDF documentation (#16929) @rjzamora
- Improve aggregation documentation (#16822) @PointKernel
- Add best practices page to Dask cuDF docs (#16821) @rjzamora
- [DOC] Update Pylibcudf doc strings (#16810) @Matt711
- Recommending
miniforgefor conda install (#16782) @mmccarty - Add labeling pylibcudf doc pages (#16779) @mroeschke
- Migrate dask-cudf README improvements to dask-cudf sphinx docs (#16765) @rjzamora
- [DOC] Remove out of date section from cudf.pandas docs (#16697) @Matt711
- Add performance tips to cudf.pandas FAQ. (#16693) @bdice
- Update documentation for Dask cuDF (#16671) @rjzamora
- Add missing pylibcudf strings docs (#16471) @brandon-b-miller
- DOC: Refresh pylibcudf guide (#15856) @lithomas1
π New Features
- Build
cudf-polarswithbuild.sh(#16898) @brandon-b-miller - Add polars to "all" dependency list. (#16875) @bdice
- nvCOMP GZIP integration (#16770) @vuule
- [FEA] Add support for
cudf.NamedAgg(#16744) @Matt711 - Add experimental
filesystem="arrow"support indask_cudf.read_parquet(#16684) @rjzamora - Relax Arrow pin (#16681) @vyasr
- Add libcudf wrappers around currentdeviceresource functions. (#16679) @harrism
- Move NDS-H examples into benchmarks (#16663) @JayjeetAtGithub
- [FEA] Add third-party library integration testing of cudf.pandas to cudf (#16645) @Matt711
- Make isinstance check pass for proxy ndarrays (#16601) @Matt711
- [FEA] Add an environment variable to fail on fallback in
cudf.pandas(#16562) @Matt711 - [FEA] Add support for
cudf.unique(#16554) @Matt711 - [FEA] Support named aggregations in
df.groupby().agg()(#16528) @Matt711 - Change IPv4 convert APIs to support UINT32 instead of INT64 (#16489) @davidwendt
- enable list to be forced as string in JSON reader. (#16472) @karthikeyann
- Remove cuDF dependency from pylibcudf column from_device tests (#16441) @brandon-b-miller
- Enable cudf.pandas REPL and -c command support (#16428) @bdice
- Setup pylibcudf package (#16299) @lithomas1
- Add a libcudf/thrust-based TPC-H derived datagen (#16294) @JayjeetAtGithub
- Make proxy NumPy arrays pass isinstance check in
cudf.pandas(#16286) @Matt711 - Add skiprows and nrows to parquet reader (#16214) @lithomas1
- Upgrade to nvcomp 4.0.1 (#16076) @vuule
- Migrate ORC reader to pylibcudf (#16042) @lithomas1
- JSON reader validation of values (#15968) @karthikeyann
- Implement exposed null mask APIs in pylibcudf (#15908) @charlesbluca
- Word-based nvtext::minhash function (#15368) @davidwendt
π οΈ Improvements
- Make tests deterministic (#16910) @galipremsagar
- Update update-version.sh to use packaging lib (#16891) @AyodeAwe
- Pin polars for 24.10 and update polars test suite xfail list (#16886) @wence-
- Add in support for setting delim when parsing JSON through java (#16867) (#16880) @revans2
- Remove unnecessary flag from build.sh (#16879) @vyasr
- Ignore numba warning specific to ARM runners (#16872) @galipremsagar
- Display deltas for
cudf.pandastest summary (#16864) @galipremsagar - Switch to using native
traceback(#16851) @galipremsagar - JSON tree algorithm code reorg (#16836) @karthikeyann
- Add string.repeats API to pylibcudf (#16834) @mroeschke
- Use CI workflow branch 'branch-24.10' again (#16832) @jameslamb
- Rename the NDS-H benchmark binaries (#16831) @JayjeetAtGithub
- Add string.findall APIs to pylibcudf (#16825) @mroeschke
- Add string.extract APIs to pylibcudf (#16823) @mroeschke
- use get-pr-info from nv-gha-runners (#16819) @AyodeAwe
- Add string.contains APIs to pylibcudf (#16814) @mroeschke
- Forward-merge branch-24.08 to branch-24.10 (#16813) @bdice
- Add iotype axis with default `PINNEDBUFFER` to nvbench PQ multithreaded reader (#16809) @mhaseeb123
- Update fmt (to 11.0.2) and spdlog (to 1.14.1). (#16806) @jameslamb
- Add ability to set parquet row group max #rows and #bytes in java (#16805) @pmattione-nvidia
- Add in option for Java JSON APIs to do column pruning in CUDF (#16796) @revans2
- Support dropfirst in getdummies (#16795) @mroeschke
- Exposed stream-ordering to join API (#16793) @lamarrr
- Add string.attributes APIs to pylibcudf (#16785) @mroeschke
- Java: Make ColumnVector.fromViewWithContiguousAllocation public (#16784) @jlowe
- Add partitioning APIs to pylibcudf (#16781) @mroeschke
- Optimization of tdigest merge aggregation. (#16780) @nvdbaranec
- use libkvikio wheels in wheel builds (#16778) @jameslamb
- Exposed stream-ordering to datetime API (#16774) @lamarrr
- Add io/timezone APIs to pylibcudf (#16771) @mroeschke
- Remove
MultiIndex._poplevelinplace implementation. (#16767) @mroeschke - allow pandas patch version to float in cudf-pandas unit tests (#16763) @jameslamb
- Simplify the nvCOMP adapter (#16762) @vuule
- Add labeling APIs to pylibcudf (#16761) @mroeschke
- Add transform APIs to pylibcudf (#16760) @mroeschke
- Add a benchmark to study Parquet reader's performance for wide tables (#16751) @mhaseeb123
- Change the Parquet writer's
default_row_group_size_bytesfrom 128MB to inf (#16750) @mhaseeb123 - Add transpose API to pylibcudf (#16749) @mroeschke
- Add support for Python 3.12, update Kafka dependencies to 2.5.x (#16745) @jameslamb
- Generate GPU vs CPU usage metrics per pytest file in pandas testsuite for
cudf.pandas(#16739) @galipremsagar - Refactor cudf pandas integration tests CI (#16728) @Matt711
- Remove ERROR_TEST gtest from libcudf (#16722) @davidwendt
- Use Series.fromcolumn more consistently to avoid validation (#16716) @mroeschke
- remove some unnecessary libcudf nightly builds (#16714) @jameslamb
- Remove xfail from torch-cudf.pandas integration test (#16705) @Matt711
- Add return type annotations to MultiIndex (#16696) @mroeschke
- Add type annotations to Index classes, utilize fromcolumn more (#16695) @mroeschke
- Have intervalrange use IntervalIndex.frombreaks, remove columnemptysame_mask (#16694) @mroeschke
- Increase timeouts for couple of tests (#16692) @galipremsagar
- Replace raw devicememoryresource pointer in pylibcudf Cython (#16674) @harrism
- switch from typing.Callable to collections.abc.Callable (#16670) @jameslamb
- Update rapidsai/pre-commit-hooks (#16669) @KyleFromNVIDIA
- Multi-file and Parquet-aware prefetching from remote storage (#16657) @rjzamora
- Access Frame attributes instead of ColumnAccessor attributes when available (#16652) @mroeschke
- Use non-mangled type names in nvbench output (#16649) @davidwendt
- Add pylibcudf build dir in build.sh for
clean(#16648) @galipremsagar - Prune workflows based on changed files (#16642) @KyleFromNVIDIA
- Remove arrow dependency (#16640) @vyasr
- Support reading multiple PQ sources with mismatching nullability for columns (#16639) @mhaseeb123
- Drop Python 3.9 support (#16637) @jameslamb
- Support DecimalDtype meta in dask_cudf (#16634) @mroeschke
- Add
num_multiprocessorsutility (#16628) @PointKernel - Annotate
ColumnAccessor._datalabels asHashable(#16623) @mroeschke - Remove buildcategoricalcolumn in favor of CategoricalColumn constructor (#16617) @mroeschke
- Move applybooleanmask benchmark to nvbench (#16616) @davidwendt
- Revise
get_reader_filepath_or_bufferto handle a list of data sources (#16613) @rjzamora - do not install cudf in cudf_polars wheel tests (#16612) @jameslamb
- remove streamz git dependency, standardize build dependency names, consolidate some dependency lists (#16611) @jameslamb
- Fix C++ and Cython io types (#16610) @vyasr
- Remove arrowiosource (#16607) @vyasr
- Remove thrust::optional from expression evaluator (#16604) @bdice
- Add stricter typing and validation to ColumnAccessor (#16602) @mroeschke
- make more use of YAML anchors in dependencies.yaml (#16597) @jameslamb
- Enable testing
cudf.pandasunit tests for all minor versions of pandas (#16595) @galipremsagar - Extend the Parquet writer's dictionary encoding benchmark. (#16591) @mhaseeb123
- Remove legacy Arrow interop APIs (#16590) @vyasr
- Remove NativeFile support from cudf Python (#16589) @vyasr
- Add build job for pylibcudf (#16587) @vyasr
- Add
publicqualifier for some member functions in Java classSchema(#16583) @ttnghia - Enable gtests previously disabled for compute-sanitizer bug (#16581) @davidwendt
- [FEA] Add filesystem argument to
cudf.read_parquet(#16577) @rjzamora - Ensure size is always passed to NumericalColumn (#16576) @mroeschke
- standardize and consolidate wheel installations in testing scripts (#16575) @jameslamb
- Performance improvement for strings::slice for wide strings (#16574) @davidwendt
- Add
ToCudfBackendexpression to dask-cudf (#16573) @rjzamora - CI: Test against old versions of key dependencies (#16570) @seberg
- Replace
NativeFiledependency in dask-cudf Parquet reader (#16569) @rjzamora - Align public utility function signatures with pandas 2.x (#16565) @mroeschke
- Move libcudf reduction google-benchmarks to nvbench (#16564) @davidwendt
- Rework strings::slice benchmark to use nvbench (#16563) @davidwendt
- Reenable arrow tests (#16556) @vyasr
- Clean up reshaping ops (#16553) @mroeschke
- Disallow cudf.Index accepting column in favor of .fromcolumn (#16549) @mroeschke
- Rewrite remaining Python Arrow interop conversions using the C Data Interface (#16548) @vyasr
- [REVIEW] JSON host tree algorithms (#16545) @shrshi
- Refactor dictionary encoding in PQ writer to migrate to the new
cuco::static_map(#16541) @mhaseeb123 - Remove hardcoded versions from workflows. (#16540) @bdice
- Ensure comparisons with pyints and integer series always succeed (#16532) @seberg
- Remove unneeded output size parameter from internal count_matches utility (#16531) @davidwendt
- Remove invalid column_view usage in string-scalar-to-column function (#16530) @davidwendt
- Raise NotImplementedError for Series.rename that's not a scalar (#16525) @mroeschke
- Remove deprecated public APIs from libcudf (#16524) @davidwendt
- Return Interval object in pandas compat mode for IntervalIndex reductions (#16523) @mroeschke
- Update json normalization to take device_buffer (#16520) @karthikeyann
- Rework cudf::io::text::byterangeinfo class member functions (#16518) @davidwendt
- Remove unneeded pair-iterator benchmark (#16511) @davidwendt
- Update pre-commit hooks (#16510) @KyleFromNVIDIA
- Improve update-version.sh (#16506) @bdice
- Use tool.scikit-build.cmake.version, set scikit-build-core minimum-version (#16503) @jameslamb
- Pass batch size to JSON reader using environment variable (#16502) @shrshi
- Remove a deprecated multibyte_split API (#16501) @davidwendt
- Add interop example for
arrow::StringViewArraytocudf::column(#16498) @JayjeetAtGithub - Add keep option to distinct nvbench (#16497) @bdice
- Use more idomatic cudf APIs in dask_cudf meta generation (#16487) @mroeschke
- Fix typo in dispatchrowequal. (#16473) @bdice
- Use explicit construction of column subclass instead of
build_columnwhen type is known (#16470) @mroeschke - Move exception handler into pylibcudf from cudf (#16468) @lithomas1
- Make StructColumn.init strict (#16467) @mroeschke
- Make ListColumn.init strict (#16465) @mroeschke
- Make Timedelta/DatetimeColumn.init strict (#16464) @mroeschke
- Make NumericalColumn.init strict (#16457) @mroeschke
- Make CategoricalColumn.init strict (#16456) @mroeschke
- Disallow cudf.Series to accept column in favor of
._from_column(#16454) @mroeschke - Expose
streamparam in transform APIs (#16452) @JayjeetAtGithub - Add upper bound pin for polars (#16442) @wence-
- Make (Indexed)Frame.init require data (and index) (#16430) @mroeschke
- Add Java APIs to copy column data to host asynchronously (#16429) @jlowe
- Update docs of the TPC-H derived examples (#16423) @JayjeetAtGithub
- Use RMM adaptor constructors instead of factories. (#16414) @bdice
- Align ewm APIs with pandas 2.x (#16413) @mroeschke
- Remove checking for specific tests in memcheck script (#16412) @davidwendt
- Add stream parameter to reshape APIs (#16410) @davidwendt
- Align groupby APIs with pandas 2.x (#16403) @mroeschke
- Align misc DataFrame and MultiIndex methods with pandas 2.x (#16402) @mroeschke
- update some branch references in GitHub Actions configs (#16397) @jameslamb
- Support reading matching projected and filter cols from Parquet files with otherwise mismatched schemas (#16394) @mhaseeb123
- Merge branch-24.08 into branch-24.10 (#16393) @jameslamb
- Add query 10 to the TPC-H suite (#16392) @JayjeetAtGithub
- Use
make_host_vectorinstead ofmake_std_vectorto facilitate pinned memory optimizations (#16386) @vuule - Fix some issues with deprecated / removed cccl facilities (#16377) @miscco
- Align IntervalIndex APIs with pandas 2.x (#16371) @mroeschke
- Align CategoricalIndex APIs with pandas 2.x (#16369) @mroeschke
- Align TimedeltaIndex APIs with pandas 2.x (#16368) @mroeschke
- Align DatetimeIndex APIs with pandas 2.x (#16367) @mroeschke
- fix [tool.setuptools] reference in custreamz config (#16365) @jameslamb
- Align Index APIs with pandas 2.x (#16361) @mroeschke
- Rebuild for & Support NumPy 2 (#16300) @jakirkham
- Add
streamparam to stream compaction APIs (#16295) @JayjeetAtGithub - Added batch memset to memset data and validity buffers in parquet reader (#16281) @sdrp713
- Deduplicate decimal32/decimal64 to decimal128 conversion function (#16236) @mhaseeb123
- Refactor mixedsemijoin using cuco::static_set (#16230) @srinivasyadav18
- Improve performance of hashcharacterngrams using warp-per-string kernel (#16212) @davidwendt
- Add environment variable to log cudf.pandas fallback calls (#16161) @mroeschke
- Add libcudf example with large strings (#15983) @davidwendt
- JSON tree algorithms refactor I: CSR data structure for column tree (#15979) @shrshi
- Support multiple new-line characters in regex APIs (#15961) @davidwendt
- adding wheel build for libcudf (#15483) @msarahan
- Replace usages of
thrust::optionalwithstd::optional(#15091) @miscco
- C++
Published by raydouglass over 1 year ago
https://github.com/rapidsai/cudf - [NIGHTLY] v24.12.00
π Links
π¨ Breaking Changes
- Move detail header floating_conversion.hpp to detail subdirectory (#17209) @davidwendt
- Refactor Dask cuDF legacy code (#17205) @rjzamora
- Make HostMemoryBuffer call into the DefaultHostMemoryAllocator (#17204) @revans2
- Remove java reservation (#17189) @revans2
- Upgrade to polars 1.11 in cudf-polars (#17154) @wence-
- Remove the additional host register calls initially intended for performance improvement on Grace Hopper (#17092) @kingcrimsontianyu
- Correctly set
is_device_accesiblewhen creatinghost_spans from other container/span types (#17079) @vuule - Unify treatment of
ExprandIRnodes in cudf-polars DSL (#17016) @wence- - Deprecate support for directly accessing logger (#16964) @vyasr
- Made cudftestutil header-only and removed GTest dependency (#16839) @lamarrr
π Bug Fixes
- Fix binop with LHS numpy datetimelike scalar (#17226) @mroeschke
- Fix groupby.get_group with length-1 tuple with list-like grouper (#17216) @mroeschke
- Fix discoverability of submodules inside
pd.util(#17215) @galipremsagar - Fix
Schema.Builderdoes not propagate precision value toBuilderinstance (#17214) @ttnghia - [BUG] Replace
repo_tokenwithgithub_tokenin Auto Assign PR GHA (#17203) @Matt711 - Remove unsanitized nulls from input strings columns in reduction gtests (#17202) @davidwendt
- Fix
to_parquetappend behavior with global metadata file (#17198) @rjzamora - Check
num_children() == 0inColumn.from_column_view(#17193) @cwharris - Fix host-to-device copy missing sync in strings/duration convert (#17149) @davidwendt
- Add JNI Support for Multi-line Delimiters and Include Test (#17139) @SurajAralihalli
- Ignore loud dask warnings about legacy dataframe implementation (#17137) @galipremsagar
- Fix the GDS read/write segfault/bus error when the cuFile policy is set to GDS or ALWAYS (#17122) @kingcrimsontianyu
- Fix
DataFrame._from_arraysand introduce validations (#17112) @galipremsagar - [Bug] Fix Arrow-FS parquet reader for larger files (#17099) @rjzamora
- Fix bug in recovering invalid lines in JSONL inputs (#17098) @shrshi
- Reenable huge pages for arrow host copying (#17097) @vyasr
- Correctly set
is_device_accesiblewhen creatinghost_spans from other container/span types (#17079) @vuule - Fix ORC reader when using
device_read_asyncwhile the destination device buffers are not ready (#17074) @ttnghia - Fix regex handling of fixed quantifier with 0 range (#17067) @davidwendt
- Limit the number of keys to calculate column sizes and page starts in PQ reader to 1B (#17059) @mhaseeb123
- Adding assertion to check for regular JSON inputs of size greater than
INT_MAXbytes (#17057) @shrshi - bug fix: use
self.ck_consumerinpollmethod of kafka.py to align with__init__(#17044) @a-hirota - Disable kvikio remote I/O to avoid openssl dependencies in JNI build (#17026) @pxLi
- Fix
host_spanconstructor to correctly copyis_device_accessible(#17020) @vuule - Add pinning for pyarrow in wheels (#17018) @vyasr
- Use std::optional for host types (#17015) @robertmaynard
- Fix write_json to handle empty string column (#16995) @karthikeyann
- Restore export of nvcomp outside of wheel builds (#16988) @KyleFromNVIDIA
- Allow melt(var_name=) to be a falsy label (#16981) @mroeschke
- Fix astype from tz-aware type to tz-aware type (#16980) @mroeschke
- Use
libcudfwheel from PR rather than nightly forpolars-polarsCI test job (#16975) @brandon-b-miller - Fix order-preservation in pandas-compat unsorted groupby (#16942) @wence-
- Fix cudf::strings::findall error with empty input (#16928) @davidwendt
- Fix JsonLargeReaderTest.MultiBatch use of LIBCUDFJSONBATCH_SIZE env var (#16927) @davidwendt
- Parse newline as whitespace character while tokenizing JSONL inputs with non-newline delimiter (#16923) @shrshi
- Respect groupby.nunique(dropna=False) (#16921) @mroeschke
- Update all rmm imports to use pylibrmm/librmm (#16913) @Matt711
- Fix order-preservation in cudf-polars groupby (#16907) @wence-
- Add a shortcut for when the input clusters are all empty for the tdigest merge (#16897) @jihoonson
- Properly handle the mapped and registered regions in
memory_mapped_source(#16865) @vuule - Fix performance regression for generatecharacterngrams (#16849) @davidwendt
- Fix regex parsing logic handling of nested quantifiers (#16798) @davidwendt
- Compute whole column variance using numerically stable approach (#16448) @wence-
π Documentation
- Fix some documentation rendering for pylibcudf (#17217) @mroeschke
- Move detail header floating_conversion.hpp to detail subdirectory (#17209) @davidwendt
- Add TokenizeVocabulary to api docs (#17208) @davidwendt
- Add jaccard_index to generated cuDF docs (#17199) @davidwendt
- [no ci] Add empty-columns section to the libcudf developer guide (#17183) @davidwendt
- Add 2-cpp approvers text to contributing guide no ci @davidwendt
- Changing developer guide int64t to int64_t (#17130) @hyperbolic2346
- docs: change 'CSV' to 'csv' in python/custreamz/README.md to match kafka.py (#17041) @a-hirota
- [DOC] Document limitation using
cudf.pandasproxy arrays (#16955) @Matt711 - [DOC] Document environment variable for failing on fallback in
cudf.pandas(#16932) @Matt711
π New Features
- Upgrade nvcomp to 4.1.0.6 (#17201) @bdice
- Support storing
precisionof decimal types inSchemaclass (#17176) @ttnghia - Add computesharedmemory_aggs used by shared memory groupby (#17162) @PointKernel
- Add computemappingindices used by shared memory groupby (#17147) @PointKernel
- Add remaining datetime APIs to pylibcudf (#17143) @Matt711
- Added strings AST vs BINARY_OP benchmarks (#17128) @lamarrr
- Include timezone file path in error message (#17102) @bdice
- Migrate NVText Byte Pair Encoding APIs to pylibcudf (#17101) @Matt711
- Migrate NVText Tokenizing APIs to pylibcudf (#17100) @Matt711
- Migrate NVtext subword tokenizing APIs to pylibcudf (#17096) @Matt711
- Migrate NVText Stemming APIs to pylibcudf (#17085) @Matt711
- Migrate NVText Replacing APIs to pylibcudf (#17084) @Matt711
- Migrate NVText Normalizing APIs to Pylibcudf (#17072) @Matt711
- Migrate remaining nvtext NGrams APIs to pylibcudf (#17070) @Matt711
- Add profilers to CUDA 12 conda devcontainers (#17066) @vyasr
- Add conda recipe for cudf-polars (#17037) @bdice
- Implement batch construction for strings columns (#17035) @ttnghia
- Add device aggregators used by shared memory groupby (#17031) @PointKernel
- Migrate Min Hashing APIs to pylibcudf (#17021) @Matt711
- Reorganize
cudf_polarsexpression code (#17014) @brandon-b-miller - Migrate nvtext jaccard API to pylibcudf (#17007) @Matt711
- Migrate nvtext generate_ngrams APIs to pylibcudf (#17006) @Matt711
- Control whether a file data source memory-maps the file with an environment variable (#17004) @vuule
- Switched BINARY_OP Benchmarks from GoogleBench to NVBench (#16963) @lamarrr
- [FEA] Migrate nvtext/edit_distance APIs to pylibcudf (#16957) @Matt711
- Switched AST benchmarks from GoogleBench to NVBench (#16952) @lamarrr
- Extend
device_scalarto optionally use pinned bounce buffer (#16947) @vuule - Expose streams in public round APIs (#16925) @Matt711
- Made cudftestutil header-only and removed GTest dependency (#16839) @lamarrr
- Add an example to demonstrate multithreaded
read_parquetpipelines (#16828) @mhaseeb123 - Implement
extract_datetime_componentinlibcudf/pylibcudf(#16776) @brandon-b-miller - Add cudf::strings::find_re API (#16742) @davidwendt
- Migrate hashing operations to
pylibcudf(#15418) @brandon-b-miller
π οΈ Improvements
- Use more pylibcudf.io.types enums in cudf._libs (#17237) @mroeschke
- Expose mixed and conditional joins in pylibcudf (#17235) @wence-
- Add
num_iterationsaxis to the multi-threaded Parquet benchmarks (#17231) @vuule - Support for polars 1.12 in cudf-polars (#17227) @wence-
- Remove
nvtext::load_vocabularyfrom pylibcudf (#17220) @Matt711 - Expose stream-ordering in partitioning API (#17213) @shrshi
- Move strings::concatenate benchmark to nvbench (#17211) @davidwendt
- Expose stream-ordering in subword tokenizer API (#17206) @shrshi
- Refactor Dask cuDF legacy code (#17205) @rjzamora
- Make HostMemoryBuffer call into the DefaultHostMemoryAllocator (#17204) @revans2
- Unified binary_ops and ast benchmarks parameter names (#17200) @lamarrr
- Add in new java API for raw host memory allocation (#17197) @revans2
- Remove java reservation (#17189) @revans2
- Fixed unused attribute compilation error for GCC 13 (#17188) @lamarrr
- Change default KvikIO parameters in cuDF: set the thread pool size to 4, and compatibility mode to ON (#17185) @kingcrimsontianyu
- Use makedeviceuvector instead of cudaMemcpyAsync in inplacebitmaskbinop (#17181) @davidwendt
- Make ai.rapids.cudf.HostMemoryBuffer#copyFromStream public. (#17179) @liurenjie1024
- Move nvtext ngrams benchmarks to nvbench (#17173) @davidwendt
- Remove includes suggested by include-what-you-use (#17170) @vyasr
- Upgrade to polars 1.11 in cudf-polars (#17154) @wence-
- Deprecate current libcudf nvtext minhash functions (#17152) @davidwendt
- Remove unused variable in internal merge_tdigests utility (#17151) @davidwendt
- Use the full ref name of
rmm.DeviceBufferin the sphinx config file (#17150) @Matt711 - Move
segmented_gatherfunction from the copying module to the lists module (#17148) @Matt711 - Use async execution policy for true_if (#17146) @PointKernel
- Add conversion from cudf-polars expressions to libcudf ast for parquet filters (#17141) @wence-
- devcontainer: replace
VAULT_HOSTwithAWS_ROLE_ARN(#17134) @jjacobelli - Replace direct
cudaMemcpyAsynccalls with utility functions (limited tocudf::io) (#17132) @vuule - use rapids-generate-pip-constraints to pin to oldest dependencies in CI (#17131) @jameslamb
- Set the default number of threads in KvikIO thread pool to 8 (#17126) @kingcrimsontianyu
- Fix clang-tidy violations for span.hpp and hostdevice_vector.hpp (#17124) @davidwendt
- Disable the Parquet reader's wide lists tables GTest by default (#17120) @mhaseeb123
- Add compile time check to ensure the
counting_iteratortype incounting_transform_iteratorfits insize_type(#17118) @mhaseeb123 - Minor I/O code quality improvements (#17105) @kingcrimsontianyu
- Remove the additional host register calls initially intended for performance improvement on Grace Hopper (#17092) @kingcrimsontianyu
- Split hash-based groupby into multiple smaller files to reduce build time (#17089) @PointKernel
- build wheels without build isolation (#17088) @jameslamb
- Remove unused hash helper functions (#17056) @PointKernel
- Add todlpack/fromdlpack APIs to pylibcudf (#17055) @mroeschke
- Move
flatten_single_pass_aggsto its own TU (#17053) @PointKernel - Replace deprecated cuco APIs with updated versions (#17052) @PointKernel
- Refactor ORC dictionary encoding to migrate to the new
cuco::static_map(#17049) @mhaseeb123 - Move pylibcudf/libcudf/wrappers/decimals to pylibcudf/libcudf/fixed_point (#17048) @mroeschke
- make conda installs in CI stricter (part 2) (#17042) @jameslamb
- Use managed memory for NDSH benchmarks (#17039) @karthikeyann
- Clean up hash-groupby
var_hash_functor(#17034) @PointKernel - Add json APIs to pylibcudf (#17025) @mroeschke
- Add string.replace_re APIs to pylibcudf (#17023) @mroeschke
- Replace old host tree algorithm with new algorithm in JSON reader (#17019) @karthikeyann
- Unify treatment of
ExprandIRnodes in cudf-polars DSL (#17016) @wence- - make conda installs in CI stricter (#17013) @jameslamb
- Pylibcudf: pack and unpack (#17012) @madsbk
- Remove unneeded pylibcudf.libcudf.wrappers.duration usage in cudf (#17010) @mroeschke
- Add custom "fused" groupby aggregation to Dask cuDF (#17009) @rjzamora
- Make tests more deterministic (#17008) @galipremsagar
- Remove unused import (#17005) @Matt711
- Add string.convert.convert_urls APIs to pylibcudf (#17003) @mroeschke
- Add release tracking to project automation scripts (#17001) @jarmak-nv
- Add string.convert.convert_lists APIs to pylibcudf (#16997) @mroeschke
- Performance optimization of JSON validation (#16996) @karthikeyann
- Add string.convert.convert_ipv4 APIs to pylibcudf (#16994) @mroeschke
- Add string.convert.convert_integers APIs to pylibcudf (#16991) @mroeschke
- Add string.convert_floats APIs to pylibcudf (#16990) @mroeschke
- Add string.convert.convertfixedtype APIs to pylibcudf (#16984) @mroeschke
- Remove unnecessary
std::move's in pylibcudf (#16983) @Matt711 - Add docstrings and test for strings.convert_durations APIs for pylibcudf (#16982) @mroeschke
- JSON tokenizer memory optimizations (#16978) @shrshi
- Turn on
xfail_strict = truefor all python packages (#16977) @wence- - Add string.convert.convertdatetime/convertbooleans APIs to pylibcudf (#16971) @mroeschke
- Auto assign PR to author (#16969) @Matt711
- Deprecate support for directly accessing logger (#16964) @vyasr
- Expunge NamedColumn (#16962) @wence-
- Add clang-tidy to CI (#16958) @vyasr
- Address all remaining clang-tidy errors (#16956) @vyasr
- Apply clang-tidy autofixes (#16949) @vyasr
- Use nvcomp wheel instead of bundling nvcomp (#16946) @KyleFromNVIDIA
- Refactor the
cuda_memcpyfunctions to make them more usable (#16945) @vuule - Add string.split APIs to pylibcudf (#16940) @mroeschke
- clang-tidy fixes part 3 (#16939) @vyasr
- clang-tidy fixes part 2 (#16938) @vyasr
- clang-tidy fixes part 1 (#16937) @vyasr
- Add string.wrap APIs to pylibcudf (#16935) @mroeschke
- Add string.translate APIs to pylibcudf (#16934) @mroeschke
- Add string.find_multiple APIs to pylibcudf (#16920) @mroeschke
- Batch memcpy the last offsets for output buffers of str and list cols in PQ reader (#16905) @mhaseeb123
- reduce wheel build verbosity, narrow deprecation warning filter (#16896) @jameslamb
- Improve aggregation device functors (#16884) @PointKernel
- Upgrade pandas pinnings to support
2.2.3(#16882) @galipremsagar - Fix 24.10 to 24.12 forward merge (#16876) @bdice
- Manually resolve conflicts in between branch-24.12 and branch-24.10 (#16871) @galipremsagar
- Add in support for setting delim when parsing JSON through java (#16867) @revans2
- Reapply
mixed_semi_joinrefactoring and bug fixes (#16859) @mhaseeb123 - Add string padding and side_type APIs to pylibcudf (#16833) @mroeschke
- Organize parquet reader mukernel non-nullable code, introduce manual block scans (#16830) @pmattione-nvidia
- Remove superfluous use of std::vector for std::future (#16829) @kingcrimsontianyu
- Rework
read_csvIO to avoid reading whole input with a singlehost_read(#16826) @vuule - Add strings.combine APIs to pylibcudf (#16790) @mroeschke
- Add remaining string.char_types APIs to pylibcudf (#16788) @mroeschke
- Avoid public constructors when called with columns to avoid unnecessary validation (#16747) @mroeschke
- Use
changed-filesshared workflow (#16713) @KyleFromNVIDIA - lint: replace
isortwith Ruff's rule I (#16685) @Borda - Parquet reader list microkernel (#16538) @pmattione-nvidia
- Refactor
histogramreduction usingcuco::static_set::insert_and_find(#16485) @srinivasyadav18 - Use numba-cuda>=0.0.13 (#16474) @gmarkall
- C++
Published by rapids-bot[bot] over 1 year ago
https://github.com/rapidsai/cudf - [NIGHTLY] v24.08.00
π Links
π¨ Breaking Changes
- Align Index init APIs with pandas 2.x (#16362) @mroeschke
- Align Series APIs with pandas 2.x (#16333) @mroeschke
- Add missing
streamparam to dictionary factory APIs (#16319) @JayjeetAtGithub - Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
- Remove squeeze argument from groupby (#16312) @mroeschke
- Align more DataFrame APIs with pandas (#16310) @mroeschke
- Remove
mrparam fromwrite_csvandwrite_json(#16231) @JayjeetAtGithub - Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
- Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
- Deprecate Arrow support in I/O (#16132) @lithomas1
- Return
FrozenListforIndex.names(#16047) @galipremsagar - Add compile option to enable large strings support (#16037) @davidwendt
- Hide visibility of non public symbols (#15982) @robertmaynard
- Rename strings multiple target replace API (#15898) @davidwendt
- Pinned vector factory that uses the global pool (#15895) @vuule
- Apply clang-tidy autofixes (#15894) @vyasr
- Support
arrow:schemain Parquet writer to faithfully roundtripdurationtypes with Arrow (#15875) @mhaseeb123 - Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
- Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
- Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice
π Bug Fixes
- Ensure managed memory is supported in cudf.pandas. (#16552) @bdice
- Add
flatbufferstolibcudfbuild (#16446) @galipremsagar - Fix parquetfieldlist read_func lambda capture invalid this pointer (#16440) @davidwendt
- Enable prefetching in cudf.pandas.install() (#16439) @bdice
- Enable prefetching before
runpy(#16427) @galipremsagar - Support thread-safe for
prefetch_config::getandprefetch_config::set(#16425) @ttnghia - Fix a
pandas-2.0missing attribute error (#16416) @galipremsagar - [Bug] Remove loud
NativeFiledeprecation noise forread_parquetfrom S3 (#16415) @rjzamora - Fix nightly memcheck error for empty STREAMINTEROPTEST (#16406) @davidwendt
- Gate ArrowStringArrayNumpySemantics cudf.pandas proxy behind version check (#16401) @mroeschke
- Don't export bsthreadpool (#16398) @KyleFromNVIDIA
- Require fixed width types for casting in
cudf-polars(#16381) @brandon-b-miller - Fix docstring of
DataFrame.apply(#16351) @galipremsagar - Make bool raise for more cudf objects (#16311) @mroeschke
- Rename
.devcontainers for CUDA 12.5 (#16293) @jakirkham - Fix split_record for all empty strings column (#16291) @davidwendt
- Fix logic in to_arrow for empty list column (#16279) @wence-
- [BUG] Make name attr of Index fast slow attrs (#16270) @Matt711
- Add custom name setter and getter for proxy objects in
cudf.pandas(#16234) @Matt711 - Fall back when casting a timestamp to numeric in cudf-polars (#16232) @brandon-b-miller
- Disable large string support for Java build (#16216) @jlowe
- Remove CCCL patch for PR 211. (#16207) @bdice
- Add single offset to an empty ListArray in cudf::to_arrow (#16201) @davidwendt
- Fix
memory_usagewhen calculating nested list column (#16193) @mroeschke - Support at/iat indexers in cudf.pandas (#16177) @mroeschke
- Fix unused-return-value debug build error in fromarrowstream_test.cpp (#16168) @davidwendt
- Fix cudf::strings::replace_multiple hang on empty target (#16167) @davidwendt
- Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
- interpolate returns new column if no values are interpolated (#16158) @mroeschke
- Use provided memory resource for allocating mixed join results. (#16153) @bdice
- Run DFG after verify-alpha-spec (#16151) @KyleFromNVIDIA
- Use size_t to allow large conditional joins (#16127) @bdice
- Allow only scale=0 fixed-point values in fixedwidthcolumn_wrapper (#16120) @davidwendt
- Fix pylibcudf Table.num_rows for 0 columns case and add interop to docs (#16108) @lithomas1
- Add support for proxy
np.flatiterobjects (#16107) @Matt711 - Ensure cudf objects can astype to any type when empty (#16106) @mroeschke
- Support
pd.read_pickleandpd.to_pickleincudf.pandas(#16105) @Matt711 - Fix unnecessarily strict check in parquet chunked reader for choosing split locations. (#16099) @nvdbaranec
- Fix
is_monotonic_*APIs to includenan's(#16085) @galipremsagar - More safely parse CUDA versions when subprocess output is contaminated (#16067) @brandon-b-miller
- fastslowproxy: Don't import assert_eq at top-level (#16063) @wence-
- Prevent bad ColumnAccessor state after .sortindex(axis=1, ignoreindex=True) (#16061) @mroeschke
- Fix ArrowDeviceArray interface to pass address of event (#16058) @zeroshade
- Fix a size overflow bug in hash groupby (#16053) @PointKernel
- Fix
atomic_refscope when multiple blocks are updating the same output (#16051) @vuule - Fix initialization error in to_arrow for empty string views (#16033) @wence-
- Fix the int32 overflow when computing page fragment sizes for large string columns (#16028) @mhaseeb123
- Fix the pool size alignment issue (#16024) @PointKernel
- Improve multibyte-split byte-range performance (#16019) @davidwendt
- Fix target counting in strings char-parallel replace (#16017) @davidwendt
- Support IntervalDtype in cudf.from_pandas (#16014) @mroeschke
- Fix memory size in createbyterangeinfosconsecutive (#16012) @davidwendt
- Hide visibility of non public symbols (#15982) @robertmaynard
- Fix Cython typo preventing proper inheritance (#15978) @vyasr
- Fix convertdtypes with convertinteger=False/convert_floating=True (#15964) @mroeschke
- Fix nunique for
MultiIndex,DataFrame, and all NA case withdropna=False(#15962) @mroeschke - Explicitly build for all GPU architectures (#15959) @vyasr
- Preserve column type and class information in more DataFrame operations (#15949) @mroeschke
- Add array_interface to cudf.pandas numpy.ndarray proxy (#15936) @mroeschke
- Allow tests to be built when stream util is disabled (#15933) @robertmaynard
- Fix JSON multi-source reading when total source size exceeds
INT_MAXbytes (#15930) @shrshi - Fix
dask_cudf.read_parquetregression for legacy timestamp data (#15929) @rjzamora - Fix offsetalator when accessing over 268 million rows (#15921) @davidwendt
- Fix debug assert in rowgroupcharcounts_kernel (#15902) @davidwendt
- Fix categorical conversion from chunked arrow arrays (#15886) @vyasr
- Handling for
NaNandinfwhen converting floating point to fixed point types (#15885) @ttnghia - Manual merge of Branch 24.08 from 24.06 (#15869) @galipremsagar
- Avoid unnecessary
Indexcast inIndexedFrame.indexsetter (#15843) @charlesbluca - Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
- Fix multi-replace target count logic for large strings (#15807) @davidwendt
- Fix JSON parsing memory corruption - Fix Mixed types nested children removal (#15798) @karthikeyann
- Allow anonymous user in devcontainer name. (#15784) @bdice
- Add support for additional metaclasses of proxies and use for ExcelWriter (#15399) @vyasr
π Documentation
- Improve Polars docs (#16820) @bdice
- Add docstring for from_dataframe (#16260) @mroeschke
- Update libcudf compiler requirements in contributing doc (#16103) @davidwendt
- Add libcudf public/detail API pattern to developer guide (#16086) @davidwendt
- Explain line profiler and how to know which functions are GPU-accelerated. (#16079) @bdice
- cudf.pandas documentation improvement (#15948) @Matt711
- Reland "Fix docs for IO readers and strings_convert" (#15872)" (#15941) @lithomas1
- Document how to use cudf.pandas in tandem with multiprocessing (#15940) @wence-
- DOC: Add documentation for cudf.pandas in the Developer Guide (#15889) @Matt711
- Improve options docs (#15888) @bdice
- DOC: add linkcode to docs (#15860) @raybellwaves
- DOC: use intersphinx mapping in pandas-compat ext (#15846) @raybellwaves
- Fix inconsistent usage of 'results' and 'records' in read-json.md (#15766) @dagardner-nv
- Update PandasCompat.py to resolve references (#15704) @raybellwaves
π New Features
- Creation of CI artifacts for cudf-polars wheels (#16680) @wence-
- Warn on cuDF failure when
POLARS_VERBOSEis true (#16308) @brandon-b-miller - Add
drop_nullsincudf-polars(#16290) @brandon-b-miller - [JNI] Add setKernelPinnedCopyThreshold and setPinnedAllocationThreshold (#16288) @abellina
- Implement support for scan_ndjson in cudf-polars (#16263) @lithomas1
- Publish cudf-polars nightlies (#16213) @lithomas1
- Modify
make_host_vectorandmake_device_uvectorfactories to optionally use pinned memory and kernel copy (#16206) @vuule - Migrate lists/set_operations to pylibcudf (#16190) @Matt711
- Migrate lists/filling to pylibcudf (#16189) @Matt711
- Fall back to CPU for unsupported libcudf binaryops in cudf-polars (#16188) @brandon-b-miller
- Use resourceref for upstream in streamcheckingresourceadaptor (#16187) @harrism
- Migrate lists/modifying to pylibcudf (#16185) @Matt711
- Migrate lists/filtering to pylibcudf (#16184) @Matt711
- Migrate lists/sorting to pylibcudf (#16179) @Matt711
- Add missing methods to lists/listcolumnview.pxd in pylibcudf (#16175) @Matt711
- Migrate pylibcudf lists gathering (#16170) @Matt711
- Move kernel vis over to CUDF_HIDDEN (#16165) @robertmaynard
- Add groupby_max multi-threaded benchmark (#16154) @srinivasyadav18
- Promote hasnestedcolumns to cudf public API (#16131) @robertmaynard
- Promote IO support queries to cudf API (#16125) @robertmaynard
- cudf::merge public API now support passing a user stream (#16124) @robertmaynard
- Add TPC-H inspired examples for Libcudf (#16088) @JayjeetAtGithub
- Installed cudf header use cudf::allocate_like (#16087) @robertmaynard
cudf-polarsstring slicing (#16082) @brandon-b-miller- Migrate Parquet reader to pylibcudf (#16078) @lithomas1
- Migrate lists/count_elements to pylibcudf (#16072) @Matt711
- Migrate lists/extract to pylibcudf (#16071) @Matt711
- Move common string utilities to public api (#16070) @robertmaynard
- stable_distinct public api now has a stream parameter (#16068) @robertmaynard
- Migrate expressions to pylibcudf (#16056) @lithomas1
- Add support to ArrowDataSource in SourceInfo (#16050) @lithomas1
- Experimental support for configurable prefetching (#16020) @vyasr
- Migrate CSV reader to pylibcudf (#16011) @lithomas1
- Migrate string
sliceAPIs topylibcudf(#15988) @brandon-b-miller - Migrate lists/contains to pylibcudf (#15981) @Matt711
- Remove CCCL 2.2 patches as we now always use 2.5+ (#15969) @robertmaynard
- Migrate JSON reader to pylibcudf (#15966) @lithomas1
- Add a developer check for proxy objects (#15956) @Matt711
- Start migrating I/O writers to pylibcudf (starting with JSON) (#15952) @lithomas1
- Kernel copy for pinned memory (#15934) @vuule
- Migrate left join and conditional join benchmarks to use nvbench (#15931) @srinivasyadav18
- Migrate lists/combine to pylibcudf (#15928) @Matt711
- Plumb pylibcudf strings
contains_rethrough cudf_polars (#15918) @brandon-b-miller - Start migrating I/O to pylibcudf (#15899) @lithomas1
- Pinned vector factory that uses the global pool (#15895) @vuule
- Migrate strings
containsoperations topylibcudf(#15880) @brandon-b-miller - Migrate quantile.pxd to pylibcudf (#15874) @lithomas1
- Migrate round to pylibcudf (#15863) @lithomas1
- Migrate string replace.pxd to pylibcudf (#15839) @lithomas1
- Add an Environment Variable for debugging the fast path in cudf.pandas (#15837) @Matt711
- Add an option to run cuIO benchmarks with pinned buffers as input (#15830) @vuule
- Update
pylibcudftesting utilities (#15772) @brandon-b-miller - Migrate string
capitalizeAPIs topylibcudf(#15503) @brandon-b-miller - Add tests for
pylibcudfbinaryops (#15470) @brandon-b-miller - Migrate column factories to pylibcudf (#15257) @brandon-b-miller
- cuDF/libcudf exponentially weighted moving averages (#9027) @brandon-b-miller
π οΈ Improvements
- Ensure objects with interface are converted to cupy/numpy arrays (#16436) @mroeschke
- Add about rmm modes in
cudf.pandasdocs (#16404) @galipremsagar - Gracefully CUDFFAIL when `skiprows > 0` in Chunked Parquet reader (#16385) @mhaseeb123
- Make C++ compilation warning free after #16297 (#16379) @wence-
- Align Index init APIs with pandas 2.x (#16362) @mroeschke
- Use rapidscpmbsthreadpool() (#16360) @KyleFromNVIDIA
- Rename PrefetchConfig to prefetch_config. (#16358) @bdice
- Implement parquet reading using pylibcudf in cudf-polars (#16346) @lithomas1
- Fix compile warnings with
jni_utils.hpp(#16336) @ttnghia - Align Series APIs with pandas 2.x (#16333) @mroeschke
- Add missing
streamparam to dictionary factory APIs (#16319) @JayjeetAtGithub - Mark cudf._typing as a typing module in ruff (#16318) @mroeschke
- Add
streamparam to list explode APIs (#16317) @JayjeetAtGithub - Fix polars for 1.2.1 (#16316) @lithomas1
- Use workflow branch 24.08 again (#16314) @KyleFromNVIDIA
- Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
- Remove squeeze argument from groupby (#16312) @mroeschke
- Align more DataFrame APIs with pandas (#16310) @mroeschke
- Clean unneeded/redudant dtype utils (#16309) @mroeschke
- Implement read_csv in cudf-polars using pylibcudf (#16307) @lithomas1
- Use Column.cancastsafely instead of some ad-hoc dtype functions in .where (#16303) @mroeschke
- Drop
{{ pin_compatible('numpy', max_pin='x') }}(#16301) @jakirkham - Host implementation of
to_arrowusing nanoarrow (#16297) @zeroshade - Add ability to prefetch in
cudf.pandasand change default to managed pool (#16296) @galipremsagar - Fix tests for polars 1.2 (#16292) @lithomas1
- Introduce dedicated options for low memory readers (#16289) @galipremsagar
- Remove decimal/floating 64/128bit switches due to register pressure (#16287) @pmattione-nvidia
- Make ColumnAccessor strictly require a mapping of columns (#16285) @mroeschke
- Introduce version file so we can conditionally handle things in tests (#16280) @wence-
- Type & reduce cupy usage (#16277) @mroeschke
- Update cudf::detail::grid1d to use threadindex_type (#16276) @davidwendt
- Replace np.isscalar/issubdtype checks with is_scalar/.kind checks (#16275) @mroeschke
- Remove xml from sortninjalog.py utility (#16274) @davidwendt
- Fix issue in horizontal concat implementation in cudf-polars (#16271) @wence-
- Preserve order in left join for cudf-polars (#16268) @wence-
- Replace isdatetime/timedeltadtype checks with .kind checks (#16262) @mroeschke
- Replace isfloat/integerdtype checks with .kind checks (#16261) @mroeschke
- Build and test with CUDA 12.5.1 (#16259) @KyleFromNVIDIA
- Replace isbooltype with checking .dtype.kind (#16255) @mroeschke
- remove
cuco_noexcept.diff(#16254) @trxcllnt - Update contains_tests.cpp to use public cudf::slice (#16253) @davidwendt
- Improve the test data for pylibcudf I/O tests (#16247) @lithomas1
- Short circuit some Column methods (#16246) @mroeschke
- Make nvcomp adapter compatible with new version macros (#16245) @vuule
- Add Column.strftime/strptime instead of overloading
as_string/datetime/timedelta_column(#16243) @mroeschke - Remove temporary functor overloads required by cuco version bump (#16242) @PointKernel
- Remove hashcharacterngrams dependency from jaccard_index (#16241) @davidwendt
- Expose sorted groupby parameters to pylibcudf (#16240) @wence-
- Expose reflection to check if casting between two types is supported (#16239) @wence-
- Handle nans in groupby-aggregations in polars executor (#16233) @wence-
- Remove
mrparam fromwrite_csvandwrite_json(#16231) @JayjeetAtGithub - Support Literals in groupby-agg (#16218) @wence-
- Handler csv reader options in cudf-polars (#16211) @wence-
- Update vendored thread_pool implementation (#16210) @wence-
- Add low memory JSON reader for
cudf.pandas(#16204) @galipremsagar - Clean up state variables in MultiIndex (#16203) @mroeschke
- skip CMake 3.30.0 (#16202) @jameslamb
- Assert valid metadata is passed in toarrow for listview (#16198) @wence-
- Expose type traits to pylibcudf (#16197) @wence-
- Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
- Cast count aggs to correct dtype in translation (#16192) @wence-
- Some small fixes in cudf-polars (#16191) @wence-
- split up CUDA-suffixed dependencies in dependencies.yaml (#16183) @jameslamb
- Define PTDS for the stream hook libs (#16182) @trxcllnt
- Make
test_python_cudf_pandasgeneraterequirements.txt(#16181) @trxcllnt - Add environment-agnostic
ci/run_cudf_polars_pytest.sh(#16178) @trxcllnt - Implement translation for some unary functions and a single datetime extraction (#16173) @wence-
- Remove size constraints on source files in batched JSON reading (#16162) @shrshi
- CI: Build wheels for cudf-polars (#16156) @lithomas1
- Update cudf-polars for v1 release of polars (#16149) @wence-
- Use strings concatenate to support large strings in CSV writer (#16148) @davidwendt
- Use verify-alpha-spec hook (#16144) @KyleFromNVIDIA
- Adds write-coalescing code path optimization to FST (#16143) @elstehle
- MAINT: Adapt to NumPy 2 promotion changes (#16141) @seberg
- API: Check for integer overflows when creating scalar form python int (#16140) @seberg
- Remove the (unused) implementation of
host_parse_nested_json(#16135) @vuule - Deprecate Arrow support in I/O (#16132) @lithomas1
- Disable dict support for split-page kernel in the parquet reader. (#16128) @nvdbaranec
- Add throughput metrics for REDUCTIONBENCH/REDUCTIONNVBENCH benchmarks (#16126) @jihoonson
- Add ensure_index to not unnecessarily shallow copy cudf.Index (#16117) @mroeschke
- Make binary operators work between fixed-point and floating args (#16116) @pmattione-nvidia
- Implement Ternary copyifelse (#16114) @wence-
- Implement handlers for series literal in cudf-polars (#16113) @wence-
- Fix dtype errors in
StringArrays(#16111) @galipremsagar - Ensure MultiIndex.to_frame deep copies columns (#16110) @mroeschke
- Parallelize
gpuInitStringDescriptorsfor fixed length byte array data (#16109) @mhaseeb123 - Finish implementation of cudf-polars boolean function handlers (#16098) @wence-
- Expose and then implement support for cross joins in cudf-polars (#16097) @wence-
- Defer copying in Column.astype(copy=True) (#16095) @mroeschke
- Fix segfault in conditional join (#16094) @bdice
- Free temp memory no longer needed in multibyte_split processing (#16091) @davidwendt
- Rename gather/scatter benchmarks to clarify coalesced behavior. (#16083) @bdice
- Adapt to polars upstream changes and turn on CI testing (#16081) @wence-
- Reduce/clean copy usage in Series, reshaping (#16080) @mroeschke
- Account for FIXEDLENBYTE_ARRAY when calculating fragment sizes in Parquet writer (#16064) @etseidl
- Reduce (shallow) copies in DataFrame ops (#16060) @mroeschke
- Add multi-file support to
dask_cudf.read_json(#16057) @rjzamora - Reduce deep copies in Index ops (#16054) @mroeschke
- Implement chunked column wise concat in chunked parquet reader (#16052) @galipremsagar
- Add exception when trying to create large strings with cudf::test::stringscolumnwrapper (#16049) @davidwendt
- Return
FrozenListforIndex.names(#16047) @galipremsagar - Add ast cast test (#16045) @pmattione-nvidia
- Remove
override_dtypesandinclude_indexfromFrame._copy_type_metadata(#16043) @mroeschke - Add ruff rules to avoid importing from typing (#16040) @mroeschke
- Fix decimal -> float cast in ast code (#16038) @pmattione-nvidia
- Add compile option to enable large strings support (#16037) @davidwendt
- Reduce conditional_join nvbench configurations (#16036) @srinivasyadav18
- Project automation update: skip if not in project (#16035) @jarmak-nv
- Add stream parameter to cudf::io::text::multibyte_split (#16034) @davidwendt
- Delete unused code from stringfunction evaluator (#16032) @wence-
- Fix exclude regex in pre-commit clang-format hook (#16030) @wence-
- Refactor rmm usage in
cudf.pandas(#16021) @galipremsagar - Enable ruff TCH: typing imports under if TYPE_CHECKING (#16015) @mroeschke
- Restrict the allowed pandas timezone objects in cudf (#16013) @mroeschke
- orc multithreaded benchmark (#16009) @zpuller
- Add tests of expression-based sort and sort-by (#16008) @wence-
- Add tests of implemented StringFunctions (#16007) @wence-
- Add test that diagonal concat with mismatching schemas raises (#16006) @wence-
- Add coverage selecting len from a dataframe (number of rows) (#16005) @wence-
- Add basic tests of dataframe scan (#16003) @wence-
- Add coverage for both expression and dataframe filter (#16002) @wence-
- Remove deprecated ExtContext node (#16001) @wence-
- Fix typo bug in gather implementation (#16000) @wence-
- Extend coverage of groupby and rolling window nodes (#15999) @wence-
- Coverage of binops where one or both operands are a scalar (#15998) @wence-
- Add full coverage for whole-frame Agg expressions (#15997) @wence-
- Add tests covering magic methods of Expr objects (#15996) @wence-
- Add full coverage of utility functions (#15995) @wence-
- Test behaviour of containers (#15994) @wence-
- Fix implemention of any, all, and isbetween (#15993) @wence-
- Raise early on unhandled PythonScan node (#15992) @wence-
- Remove mapfunction nodes that don't exist/aren't supported (#15991) @wence-
- Add test coverage for slicing with "out of bounds" negative indices (#15990) @wence-
- Standardize and type
Series.dtmethods (#15987) @mroeschke - Refactor distinct with hashset-based algorithms (#15984) @srinivasyadav18
- resolve dependency-file-generator warning, remove unnecessary rapids-build-backend configuration (#15980) @jameslamb
- Project automation bug fixes (#15971) @jarmak-nv
- Add typing to singlecolumnframe (#15965) @mroeschke
- Move some misc Frame methods to appropriate locations (#15963) @mroeschke
- Condense pylibcudf data fixtures (#15958) @lithomas1
- Refactor fillna logic to push specifics toward Frame subclasses and Column subclasses (#15957) @mroeschke
- Remove unused parsing utilities (#15955) @vuule
- Remove
Scalarcontainer type from polars interpreter (#15953) @wence- - Support arbitrary CUDA versions in UDF code (#15950) @bdice
- Support large strings in cudf::io::text::multibyte_split (#15947) @davidwendt
- Add external issue label and project automation (#15945) @jarmak-nv
- Enable round-tripping of large strings in
cudf(#15944) @galipremsagar - Add more complete type annotations in polars interpreter (#15942) @wence-
- Update implementations to build with the latest cuco (#15938) @PointKernel
- Support timezone aware pandas inputs in cudf (#15935) @mroeschke
- Define Column.nanasnull to return self (#15923) @mroeschke
- Make Frame._dtype an iterator instead of a dict (#15920) @mroeschke
- Port start of datetime.hpp to pylibcudf (#15916) @wence-
- Introduce
NamedColumnconcept in cudf-polars (#15914) @wence- - Avoid redefining Frame.getcolumnsbylabel in subclasses (#15912) @mroeschke
- Templatization of fixed-width parquet decoding kernels. (#15911) @nvdbaranec
- New Decimal <--> Floating conversion (#15905) @pmattione-nvidia
- Use Arrow C Data Interface functions for Python interop (#15904) @vyasr
- Use offsetalator in cudf::io::json::detail::parse_string (#15900) @davidwendt
- Rename strings multiple target replace API (#15898) @davidwendt
- Apply clang-tidy autofixes (#15894) @vyasr
- Update Python labels and remove unnecessary ones (#15893) @vyasr
- Clean up pylibcudf test assertations (#15892) @lithomas1
- Use offsetalator in orc rowgroupcharcounts_kernel (#15891) @davidwendt
- Ensure literals have correct dtype (#15890) @wence-
- Add overflow check when converting large strings to lists columns (#15887) @davidwendt
- Use offsetalator in nvtext::tokenizewithvocabulary (#15878) @davidwendt
- Update interleave lists column for large strings (#15877) @davidwendt
- Simple NumPy 2 fixes that are clearly no behavior change (#15876) @seberg
- Support
arrow:schemain Parquet writer to faithfully roundtripdurationtypes with Arrow (#15875) @mhaseeb123 - Refactor join benchmarks to target public APIs with the default stream (#15873) @PointKernel
- Fix url-decode benchmark to use offsetalator (#15871) @davidwendt
- Use offsetalator in strings shift functor (#15870) @davidwendt
- Memory Profiling (#15866) @madsbk
- Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
- Make Frame.astype return Self instead of a ColumnAccessor (#15861) @mroeschke
- Use ColumnAccessor row and column length attributes more consistently (#15857) @mroeschke
- add unit test setup for cudf_kafka (#15853) @jameslamb
- Remove internal usage of core.index.as_index in favor of cudf.Index (#15851) @mroeschke
- Ensure cudf.Series(cudf.Series(...)) creates a reference to the same index (#15845) @mroeschke
- Remove benchmark-specific use of pinned-pooled memory in Parquet multithreaded benchmark. (#15838) @nvdbaranec
- Implement
on_bad_linesin json reader (#15834) @galipremsagar - Make Column.to_pandas return Index instead of Series (#15833) @mroeschke
- Add test of interoperability of cuDF and arrow BYTESTREAMSPLIT encoders (#15832) @etseidl
- Refactor Parquet writer options and builders (#15831) @etseidl
- Migrate reshape.pxd to pylibcudf (#15827) @lithomas1
- Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice
- Switch cuIO benchmarks to use pinned-pool host allocations by default. (#15805) @nvdbaranec
- Change thrust::count_if call to raw kernel in strings split APIs (#15762) @davidwendt
- Improve performance for long strings for nvtext::replace_tokens (#15756) @davidwendt
- Implement chunked parquet reader in cudf-python (#15728) @galipremsagar
- Add
from_arrow_hostfunctions for cudf interop with nanoarrow (#15645) @zeroshade - Add ability to enable rmm pool on
cudf.pandasimport (#15628) @galipremsagar - Executor for polars logical plans (#15504) @wence-
- Implement dayname and monthname to match pandas (#15479) @btepera
- Utilities for decimal <--> floating conversion (#15359) @pmattione-nvidia
- For powers of 10, replace ipow with switch (#15353) @pmattione-nvidia
- Use rapids-build-backend. (#15245) @vyasr
- Add
codecovcoverage forpandas_tests(#14513) @galipremsagar
- C++
Published by rapids-bot[bot] over 1 year ago
https://github.com/rapidsai/cudf - v24.08.03
π¨ Breaking Changes
- Align Index init APIs with pandas 2.x (#16362) @mroeschke
- Align Series APIs with pandas 2.x (#16333) @mroeschke
- Add missing
streamparam to dictionary factory APIs (#16319) @JayjeetAtGithub - Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
- Remove squeeze argument from groupby (#16312) @mroeschke
- Align more DataFrame APIs with pandas (#16310) @mroeschke
- Remove
mrparam fromwrite_csvandwrite_json(#16231) @JayjeetAtGithub - Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
- Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
- Deprecate Arrow support in I/O (#16132) @lithomas1
- Return
FrozenListforIndex.names(#16047) @galipremsagar - Add compile option to enable large strings support (#16037) @davidwendt
- Hide visibility of non public symbols (#15982) @robertmaynard
- Rename strings multiple target replace API (#15898) @davidwendt
- Pinned vector factory that uses the global pool (#15895) @vuule
- Apply clang-tidy autofixes (#15894) @vyasr
- Support
arrow:schemain Parquet writer to faithfully roundtripdurationtypes with Arrow (#15875) @mhaseeb123 - Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
- Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
- Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice
π Bug Fixes
- Ensure managed memory is supported in cudf.pandas. (#16552) @bdice
- Add
flatbufferstolibcudfbuild (#16446) @galipremsagar - Fix parquetfieldlist read_func lambda capture invalid this pointer (#16440) @davidwendt
- Enable prefetching in cudf.pandas.install() (#16439) @bdice
- Enable prefetching before
runpy(#16427) @galipremsagar - Support thread-safe for
prefetch_config::getandprefetch_config::set(#16425) @ttnghia - Fix a
pandas-2.0missing attribute error (#16416) @galipremsagar - [Bug] Remove loud
NativeFiledeprecation noise forread_parquetfrom S3 (#16415) @rjzamora - Fix nightly memcheck error for empty STREAMINTEROPTEST (#16406) @davidwendt
- Gate ArrowStringArrayNumpySemantics cudf.pandas proxy behind version check (#16401) @mroeschke
- Don't export bsthreadpool (#16398) @KyleFromNVIDIA
- Require fixed width types for casting in
cudf-polars(#16381) @brandon-b-miller - Fix docstring of
DataFrame.apply(#16351) @galipremsagar - Make bool raise for more cudf objects (#16311) @mroeschke
- Rename
.devcontainers for CUDA 12.5 (#16293) @jakirkham - Fix split_record for all empty strings column (#16291) @davidwendt
- Fix logic in to_arrow for empty list column (#16279) @wence-
- [BUG] Make name attr of Index fast slow attrs (#16270) @Matt711
- Add custom name setter and getter for proxy objects in
cudf.pandas(#16234) @Matt711 - Fall back when casting a timestamp to numeric in cudf-polars (#16232) @brandon-b-miller
- Disable large string support for Java build (#16216) @jlowe
- Remove CCCL patch for PR 211. (#16207) @bdice
- Add single offset to an empty ListArray in cudf::to_arrow (#16201) @davidwendt
- Fix
memory_usagewhen calculating nested list column (#16193) @mroeschke - Support at/iat indexers in cudf.pandas (#16177) @mroeschke
- Fix unused-return-value debug build error in fromarrowstream_test.cpp (#16168) @davidwendt
- Fix cudf::strings::replace_multiple hang on empty target (#16167) @davidwendt
- Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
- interpolate returns new column if no values are interpolated (#16158) @mroeschke
- Use provided memory resource for allocating mixed join results. (#16153) @bdice
- Run DFG after verify-alpha-spec (#16151) @KyleFromNVIDIA
- Use size_t to allow large conditional joins (#16127) @bdice
- Allow only scale=0 fixed-point values in fixedwidthcolumn_wrapper (#16120) @davidwendt
- Fix pylibcudf Table.num_rows for 0 columns case and add interop to docs (#16108) @lithomas1
- Add support for proxy
np.flatiterobjects (#16107) @Matt711 - Ensure cudf objects can astype to any type when empty (#16106) @mroeschke
- Support
pd.read_pickleandpd.to_pickleincudf.pandas(#16105) @Matt711 - Fix unnecessarily strict check in parquet chunked reader for choosing split locations. (#16099) @nvdbaranec
- Fix
is_monotonic_*APIs to includenan's(#16085) @galipremsagar - More safely parse CUDA versions when subprocess output is contaminated (#16067) @brandon-b-miller
- fastslowproxy: Don't import assert_eq at top-level (#16063) @wence-
- Prevent bad ColumnAccessor state after .sortindex(axis=1, ignoreindex=True) (#16061) @mroeschke
- Fix ArrowDeviceArray interface to pass address of event (#16058) @zeroshade
- Fix a size overflow bug in hash groupby (#16053) @PointKernel
- Fix
atomic_refscope when multiple blocks are updating the same output (#16051) @vuule - Fix initialization error in to_arrow for empty string views (#16033) @wence-
- Fix the int32 overflow when computing page fragment sizes for large string columns (#16028) @mhaseeb123
- Fix the pool size alignment issue (#16024) @PointKernel
- Improve multibyte-split byte-range performance (#16019) @davidwendt
- Fix target counting in strings char-parallel replace (#16017) @davidwendt
- Support IntervalDtype in cudf.from_pandas (#16014) @mroeschke
- Fix memory size in createbyterangeinfosconsecutive (#16012) @davidwendt
- Hide visibility of non public symbols (#15982) @robertmaynard
- Fix Cython typo preventing proper inheritance (#15978) @vyasr
- Fix convertdtypes with convertinteger=False/convert_floating=True (#15964) @mroeschke
- Fix nunique for
MultiIndex,DataFrame, and all NA case withdropna=False(#15962) @mroeschke - Explicitly build for all GPU architectures (#15959) @vyasr
- Preserve column type and class information in more DataFrame operations (#15949) @mroeschke
- Add array_interface to cudf.pandas numpy.ndarray proxy (#15936) @mroeschke
- Allow tests to be built when stream util is disabled (#15933) @robertmaynard
- Fix JSON multi-source reading when total source size exceeds
INT_MAXbytes (#15930) @shrshi - Fix
dask_cudf.read_parquetregression for legacy timestamp data (#15929) @rjzamora - Fix offsetalator when accessing over 268 million rows (#15921) @davidwendt
- Fix debug assert in rowgroupcharcounts_kernel (#15902) @davidwendt
- Fix categorical conversion from chunked arrow arrays (#15886) @vyasr
- Handling for
NaNandinfwhen converting floating point to fixed point types (#15885) @ttnghia - Manual merge of Branch 24.08 from 24.06 (#15869) @galipremsagar
- Avoid unnecessary
Indexcast inIndexedFrame.indexsetter (#15843) @charlesbluca - Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
- Fix multi-replace target count logic for large strings (#15807) @davidwendt
- Fix JSON parsing memory corruption - Fix Mixed types nested children removal (#15798) @karthikeyann
- Allow anonymous user in devcontainer name. (#15784) @bdice
- Add support for additional metaclasses of proxies and use for ExcelWriter (#15399) @vyasr
π Documentation
- Add docstring for from_dataframe (#16260) @mroeschke
- Update libcudf compiler requirements in contributing doc (#16103) @davidwendt
- Add libcudf public/detail API pattern to developer guide (#16086) @davidwendt
- Explain line profiler and how to know which functions are GPU-accelerated. (#16079) @bdice
- cudf.pandas documentation improvement (#15948) @Matt711
- Reland "Fix docs for IO readers and strings_convert" (#15872)" (#15941) @lithomas1
- Document how to use cudf.pandas in tandem with multiprocessing (#15940) @wence-
- DOC: Add documentation for cudf.pandas in the Developer Guide (#15889) @Matt711
- Improve options docs (#15888) @bdice
- DOC: add linkcode to docs (#15860) @raybellwaves
- DOC: use intersphinx mapping in pandas-compat ext (#15846) @raybellwaves
- Fix inconsistent usage of 'results' and 'records' in read-json.md (#15766) @dagardner-nv
- Update PandasCompat.py to resolve references (#15704) @raybellwaves
π New Features
- Creation of CI artifacts for cudf-polars wheels (#16680) @wence-
- Warn on cuDF failure when
POLARS_VERBOSEis true (#16308) @brandon-b-miller - Add
drop_nullsincudf-polars(#16290) @brandon-b-miller - [JNI] Add setKernelPinnedCopyThreshold and setPinnedAllocationThreshold (#16288) @abellina
- Implement support for scan_ndjson in cudf-polars (#16263) @lithomas1
- Publish cudf-polars nightlies (#16213) @lithomas1
- Modify
make_host_vectorandmake_device_uvectorfactories to optionally use pinned memory and kernel copy (#16206) @vuule - Migrate lists/set_operations to pylibcudf (#16190) @Matt711
- Migrate lists/filling to pylibcudf (#16189) @Matt711
- Fall back to CPU for unsupported libcudf binaryops in cudf-polars (#16188) @brandon-b-miller
- Use resourceref for upstream in streamcheckingresourceadaptor (#16187) @harrism
- Migrate lists/modifying to pylibcudf (#16185) @Matt711
- Migrate lists/filtering to pylibcudf (#16184) @Matt711
- Migrate lists/sorting to pylibcudf (#16179) @Matt711
- Add missing methods to lists/listcolumnview.pxd in pylibcudf (#16175) @Matt711
- Migrate pylibcudf lists gathering (#16170) @Matt711
- Move kernel vis over to CUDF_HIDDEN (#16165) @robertmaynard
- Add groupby_max multi-threaded benchmark (#16154) @srinivasyadav18
- Promote hasnestedcolumns to cudf public API (#16131) @robertmaynard
- Promote IO support queries to cudf API (#16125) @robertmaynard
- cudf::merge public API now support passing a user stream (#16124) @robertmaynard
- Add TPC-H inspired examples for Libcudf (#16088) @JayjeetAtGithub
- Installed cudf header use cudf::allocate_like (#16087) @robertmaynard
cudf-polarsstring slicing (#16082) @brandon-b-miller- Migrate Parquet reader to pylibcudf (#16078) @lithomas1
- Migrate lists/count_elements to pylibcudf (#16072) @Matt711
- Migrate lists/extract to pylibcudf (#16071) @Matt711
- Move common string utilities to public api (#16070) @robertmaynard
- stable_distinct public api now has a stream parameter (#16068) @robertmaynard
- Migrate expressions to pylibcudf (#16056) @lithomas1
- Add support to ArrowDataSource in SourceInfo (#16050) @lithomas1
- Experimental support for configurable prefetching (#16020) @vyasr
- Migrate CSV reader to pylibcudf (#16011) @lithomas1
- Migrate string
sliceAPIs topylibcudf(#15988) @brandon-b-miller - Migrate lists/contains to pylibcudf (#15981) @Matt711
- Remove CCCL 2.2 patches as we now always use 2.5+ (#15969) @robertmaynard
- Migrate JSON reader to pylibcudf (#15966) @lithomas1
- Add a developer check for proxy objects (#15956) @Matt711
- Start migrating I/O writers to pylibcudf (starting with JSON) (#15952) @lithomas1
- Kernel copy for pinned memory (#15934) @vuule
- Migrate left join and conditional join benchmarks to use nvbench (#15931) @srinivasyadav18
- Migrate lists/combine to pylibcudf (#15928) @Matt711
- Plumb pylibcudf strings
contains_rethrough cudf_polars (#15918) @brandon-b-miller - Start migrating I/O to pylibcudf (#15899) @lithomas1
- Pinned vector factory that uses the global pool (#15895) @vuule
- Migrate strings
containsoperations topylibcudf(#15880) @brandon-b-miller - Migrate quantile.pxd to pylibcudf (#15874) @lithomas1
- Migrate round to pylibcudf (#15863) @lithomas1
- Migrate string replace.pxd to pylibcudf (#15839) @lithomas1
- Add an Environment Variable for debugging the fast path in cudf.pandas (#15837) @Matt711
- Add an option to run cuIO benchmarks with pinned buffers as input (#15830) @vuule
- Update
pylibcudftesting utilities (#15772) @brandon-b-miller - Migrate string
capitalizeAPIs topylibcudf(#15503) @brandon-b-miller - Add tests for
pylibcudfbinaryops (#15470) @brandon-b-miller - Migrate column factories to pylibcudf (#15257) @brandon-b-miller
- cuDF/libcudf exponentially weighted moving averages (#9027) @brandon-b-miller
π οΈ Improvements
- Ensure objects with interface are converted to cupy/numpy arrays (#16436) @mroeschke
- Add about rmm modes in
cudf.pandasdocs (#16404) @galipremsagar - Gracefully CUDFFAIL when `skiprows > 0` in Chunked Parquet reader (#16385) @mhaseeb123
- Make C++ compilation warning free after #16297 (#16379) @wence-
- Align Index init APIs with pandas 2.x (#16362) @mroeschke
- Use rapidscpmbsthreadpool() (#16360) @KyleFromNVIDIA
- Rename PrefetchConfig to prefetch_config. (#16358) @bdice
- Implement parquet reading using pylibcudf in cudf-polars (#16346) @lithomas1
- Fix compile warnings with
jni_utils.hpp(#16336) @ttnghia - Align Series APIs with pandas 2.x (#16333) @mroeschke
- Add missing
streamparam to dictionary factory APIs (#16319) @JayjeetAtGithub - Mark cudf._typing as a typing module in ruff (#16318) @mroeschke
- Add
streamparam to list explode APIs (#16317) @JayjeetAtGithub - Fix polars for 1.2.1 (#16316) @lithomas1
- Use workflow branch 24.08 again (#16314) @KyleFromNVIDIA
- Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
- Remove squeeze argument from groupby (#16312) @mroeschke
- Align more DataFrame APIs with pandas (#16310) @mroeschke
- Clean unneeded/redudant dtype utils (#16309) @mroeschke
- Implement read_csv in cudf-polars using pylibcudf (#16307) @lithomas1
- Use Column.cancastsafely instead of some ad-hoc dtype functions in .where (#16303) @mroeschke
- Drop
{{ pin_compatible('numpy', max_pin='x') }}(#16301) @jakirkham - Host implementation of
to_arrowusing nanoarrow (#16297) @zeroshade - Add ability to prefetch in
cudf.pandasand change default to managed pool (#16296) @galipremsagar - Fix tests for polars 1.2 (#16292) @lithomas1
- Introduce dedicated options for low memory readers (#16289) @galipremsagar
- Remove decimal/floating 64/128bit switches due to register pressure (#16287) @pmattione-nvidia
- Make ColumnAccessor strictly require a mapping of columns (#16285) @mroeschke
- Introduce version file so we can conditionally handle things in tests (#16280) @wence-
- Type & reduce cupy usage (#16277) @mroeschke
- Update cudf::detail::grid1d to use threadindex_type (#16276) @davidwendt
- Replace np.isscalar/issubdtype checks with is_scalar/.kind checks (#16275) @mroeschke
- Remove xml from sortninjalog.py utility (#16274) @davidwendt
- Fix issue in horizontal concat implementation in cudf-polars (#16271) @wence-
- Preserve order in left join for cudf-polars (#16268) @wence-
- Replace isdatetime/timedeltadtype checks with .kind checks (#16262) @mroeschke
- Replace isfloat/integerdtype checks with .kind checks (#16261) @mroeschke
- Build and test with CUDA 12.5.1 (#16259) @KyleFromNVIDIA
- Replace isbooltype with checking .dtype.kind (#16255) @mroeschke
- remove
cuco_noexcept.diff(#16254) @trxcllnt - Update contains_tests.cpp to use public cudf::slice (#16253) @davidwendt
- Improve the test data for pylibcudf I/O tests (#16247) @lithomas1
- Short circuit some Column methods (#16246) @mroeschke
- Make nvcomp adapter compatible with new version macros (#16245) @vuule
- Add Column.strftime/strptime instead of overloading
as_string/datetime/timedelta_column(#16243) @mroeschke - Remove temporary functor overloads required by cuco version bump (#16242) @PointKernel
- Remove hashcharacterngrams dependency from jaccard_index (#16241) @davidwendt
- Expose sorted groupby parameters to pylibcudf (#16240) @wence-
- Expose reflection to check if casting between two types is supported (#16239) @wence-
- Handle nans in groupby-aggregations in polars executor (#16233) @wence-
- Remove
mrparam fromwrite_csvandwrite_json(#16231) @JayjeetAtGithub - Support Literals in groupby-agg (#16218) @wence-
- Handler csv reader options in cudf-polars (#16211) @wence-
- Update vendored thread_pool implementation (#16210) @wence-
- Add low memory JSON reader for
cudf.pandas(#16204) @galipremsagar - Clean up state variables in MultiIndex (#16203) @mroeschke
- skip CMake 3.30.0 (#16202) @jameslamb
- Assert valid metadata is passed in toarrow for listview (#16198) @wence-
- Expose type traits to pylibcudf (#16197) @wence-
- Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
- Cast count aggs to correct dtype in translation (#16192) @wence-
- Some small fixes in cudf-polars (#16191) @wence-
- split up CUDA-suffixed dependencies in dependencies.yaml (#16183) @jameslamb
- Define PTDS for the stream hook libs (#16182) @trxcllnt
- Make
test_python_cudf_pandasgeneraterequirements.txt(#16181) @trxcllnt - Add environment-agnostic
ci/run_cudf_polars_pytest.sh(#16178) @trxcllnt - Implement translation for some unary functions and a single datetime extraction (#16173) @wence-
- Remove size constraints on source files in batched JSON reading (#16162) @shrshi
- CI: Build wheels for cudf-polars (#16156) @lithomas1
- Update cudf-polars for v1 release of polars (#16149) @wence-
- Use strings concatenate to support large strings in CSV writer (#16148) @davidwendt
- Use verify-alpha-spec hook (#16144) @KyleFromNVIDIA
- Adds write-coalescing code path optimization to FST (#16143) @elstehle
- MAINT: Adapt to NumPy 2 promotion changes (#16141) @seberg
- API: Check for integer overflows when creating scalar form python int (#16140) @seberg
- Remove the (unused) implementation of
host_parse_nested_json(#16135) @vuule - Deprecate Arrow support in I/O (#16132) @lithomas1
- Disable dict support for split-page kernel in the parquet reader. (#16128) @nvdbaranec
- Add throughput metrics for REDUCTIONBENCH/REDUCTIONNVBENCH benchmarks (#16126) @jihoonson
- Add ensure_index to not unnecessarily shallow copy cudf.Index (#16117) @mroeschke
- Make binary operators work between fixed-point and floating args (#16116) @pmattione-nvidia
- Implement Ternary copyifelse (#16114) @wence-
- Implement handlers for series literal in cudf-polars (#16113) @wence-
- Fix dtype errors in
StringArrays(#16111) @galipremsagar - Ensure MultiIndex.to_frame deep copies columns (#16110) @mroeschke
- Parallelize
gpuInitStringDescriptorsfor fixed length byte array data (#16109) @mhaseeb123 - Finish implementation of cudf-polars boolean function handlers (#16098) @wence-
- Expose and then implement support for cross joins in cudf-polars (#16097) @wence-
- Defer copying in Column.astype(copy=True) (#16095) @mroeschke
- Fix segfault in conditional join (#16094) @bdice
- Free temp memory no longer needed in multibyte_split processing (#16091) @davidwendt
- Rename gather/scatter benchmarks to clarify coalesced behavior. (#16083) @bdice
- Adapt to polars upstream changes and turn on CI testing (#16081) @wence-
- Reduce/clean copy usage in Series, reshaping (#16080) @mroeschke
- Account for FIXEDLENBYTE_ARRAY when calculating fragment sizes in Parquet writer (#16064) @etseidl
- Reduce (shallow) copies in DataFrame ops (#16060) @mroeschke
- Add multi-file support to
dask_cudf.read_json(#16057) @rjzamora - Reduce deep copies in Index ops (#16054) @mroeschke
- Implement chunked column wise concat in chunked parquet reader (#16052) @galipremsagar
- Add exception when trying to create large strings with cudf::test::stringscolumnwrapper (#16049) @davidwendt
- Return
FrozenListforIndex.names(#16047) @galipremsagar - Add ast cast test (#16045) @pmattione-nvidia
- Remove
override_dtypesandinclude_indexfromFrame._copy_type_metadata(#16043) @mroeschke - Add ruff rules to avoid importing from typing (#16040) @mroeschke
- Fix decimal -> float cast in ast code (#16038) @pmattione-nvidia
- Add compile option to enable large strings support (#16037) @davidwendt
- Reduce conditional_join nvbench configurations (#16036) @srinivasyadav18
- Project automation update: skip if not in project (#16035) @jarmak-nv
- Add stream parameter to cudf::io::text::multibyte_split (#16034) @davidwendt
- Delete unused code from stringfunction evaluator (#16032) @wence-
- Fix exclude regex in pre-commit clang-format hook (#16030) @wence-
- Refactor rmm usage in
cudf.pandas(#16021) @galipremsagar - Enable ruff TCH: typing imports under if TYPE_CHECKING (#16015) @mroeschke
- Restrict the allowed pandas timezone objects in cudf (#16013) @mroeschke
- orc multithreaded benchmark (#16009) @zpuller
- Add tests of expression-based sort and sort-by (#16008) @wence-
- Add tests of implemented StringFunctions (#16007) @wence-
- Add test that diagonal concat with mismatching schemas raises (#16006) @wence-
- Add coverage selecting len from a dataframe (number of rows) (#16005) @wence-
- Add basic tests of dataframe scan (#16003) @wence-
- Add coverage for both expression and dataframe filter (#16002) @wence-
- Remove deprecated ExtContext node (#16001) @wence-
- Fix typo bug in gather implementation (#16000) @wence-
- Extend coverage of groupby and rolling window nodes (#15999) @wence-
- Coverage of binops where one or both operands are a scalar (#15998) @wence-
- Add full coverage for whole-frame Agg expressions (#15997) @wence-
- Add tests covering magic methods of Expr objects (#15996) @wence-
- Add full coverage of utility functions (#15995) @wence-
- Test behaviour of containers (#15994) @wence-
- Fix implemention of any, all, and isbetween (#15993) @wence-
- Raise early on unhandled PythonScan node (#15992) @wence-
- Remove mapfunction nodes that don't exist/aren't supported (#15991) @wence-
- Add test coverage for slicing with "out of bounds" negative indices (#15990) @wence-
- Standardize and type
Series.dtmethods (#15987) @mroeschke - Refactor distinct with hashset-based algorithms (#15984) @srinivasyadav18
- resolve dependency-file-generator warning, remove unnecessary rapids-build-backend configuration (#15980) @jameslamb
- Project automation bug fixes (#15971) @jarmak-nv
- Add typing to singlecolumnframe (#15965) @mroeschke
- Move some misc Frame methods to appropriate locations (#15963) @mroeschke
- Condense pylibcudf data fixtures (#15958) @lithomas1
- Refactor fillna logic to push specifics toward Frame subclasses and Column subclasses (#15957) @mroeschke
- Remove unused parsing utilities (#15955) @vuule
- Remove
Scalarcontainer type from polars interpreter (#15953) @wence- - Support arbitrary CUDA versions in UDF code (#15950) @bdice
- Support large strings in cudf::io::text::multibyte_split (#15947) @davidwendt
- Add external issue label and project automation (#15945) @jarmak-nv
- Enable round-tripping of large strings in
cudf(#15944) @galipremsagar - Add more complete type annotations in polars interpreter (#15942) @wence-
- Update implementations to build with the latest cuco (#15938) @PointKernel
- Support timezone aware pandas inputs in cudf (#15935) @mroeschke
- Define Column.nanasnull to return self (#15923) @mroeschke
- Make Frame._dtype an iterator instead of a dict (#15920) @mroeschke
- Port start of datetime.hpp to pylibcudf (#15916) @wence-
- Introduce
NamedColumnconcept in cudf-polars (#15914) @wence- - Avoid redefining Frame.getcolumnsbylabel in subclasses (#15912) @mroeschke
- Templatization of fixed-width parquet decoding kernels. (#15911) @nvdbaranec
- New Decimal <--> Floating conversion (#15905) @pmattione-nvidia
- Use Arrow C Data Interface functions for Python interop (#15904) @vyasr
- Use offsetalator in cudf::io::json::detail::parse_string (#15900) @davidwendt
- Rename strings multiple target replace API (#15898) @davidwendt
- Apply clang-tidy autofixes (#15894) @vyasr
- Update Python labels and remove unnecessary ones (#15893) @vyasr
- Clean up pylibcudf test assertations (#15892) @lithomas1
- Use offsetalator in orc rowgroupcharcounts_kernel (#15891) @davidwendt
- Ensure literals have correct dtype (#15890) @wence-
- Add overflow check when converting large strings to lists columns (#15887) @davidwendt
- Use offsetalator in nvtext::tokenizewithvocabulary (#15878) @davidwendt
- Update interleave lists column for large strings (#15877) @davidwendt
- Simple NumPy 2 fixes that are clearly no behavior change (#15876) @seberg
- Support
arrow:schemain Parquet writer to faithfully roundtripdurationtypes with Arrow (#15875) @mhaseeb123 - Refactor join benchmarks to target public APIs with the default stream (#15873) @PointKernel
- Fix url-decode benchmark to use offsetalator (#15871) @davidwendt
- Use offsetalator in strings shift functor (#15870) @davidwendt
- Memory Profiling (#15866) @madsbk
- Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
- Make Frame.astype return Self instead of a ColumnAccessor (#15861) @mroeschke
- Use ColumnAccessor row and column length attributes more consistently (#15857) @mroeschke
- add unit test setup for cudf_kafka (#15853) @jameslamb
- Remove internal usage of core.index.as_index in favor of cudf.Index (#15851) @mroeschke
- Ensure cudf.Series(cudf.Series(...)) creates a reference to the same index (#15845) @mroeschke
- Remove benchmark-specific use of pinned-pooled memory in Parquet multithreaded benchmark. (#15838) @nvdbaranec
- Implement
on_bad_linesin json reader (#15834) @galipremsagar - Make Column.to_pandas return Index instead of Series (#15833) @mroeschke
- Add test of interoperability of cuDF and arrow BYTESTREAMSPLIT encoders (#15832) @etseidl
- Refactor Parquet writer options and builders (#15831) @etseidl
- Migrate reshape.pxd to pylibcudf (#15827) @lithomas1
- Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice
- Switch cuIO benchmarks to use pinned-pool host allocations by default. (#15805) @nvdbaranec
- Change thrust::count_if call to raw kernel in strings split APIs (#15762) @davidwendt
- Improve performance for long strings for nvtext::replace_tokens (#15756) @davidwendt
- Implement chunked parquet reader in cudf-python (#15728) @galipremsagar
- Add
from_arrow_hostfunctions for cudf interop with nanoarrow (#15645) @zeroshade - Add ability to enable rmm pool on
cudf.pandasimport (#15628) @galipremsagar - Executor for polars logical plans (#15504) @wence-
- Implement dayname and monthname to match pandas (#15479) @btepera
- Utilities for decimal <--> floating conversion (#15359) @pmattione-nvidia
- For powers of 10, replace ipow with switch (#15353) @pmattione-nvidia
- Use rapids-build-backend. (#15245) @vyasr
- Add
codecovcoverage forpandas_tests(#14513) @galipremsagar
- C++
Published by raydouglass over 1 year ago
https://github.com/rapidsai/cudf - v24.08.02
π¨ Breaking Changes
- Align Index init APIs with pandas 2.x (#16362) @mroeschke
- Align Series APIs with pandas 2.x (#16333) @mroeschke
- Add missing
streamparam to dictionary factory APIs (#16319) @JayjeetAtGithub - Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
- Remove squeeze argument from groupby (#16312) @mroeschke
- Align more DataFrame APIs with pandas (#16310) @mroeschke
- Remove
mrparam fromwrite_csvandwrite_json(#16231) @JayjeetAtGithub - Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
- Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
- Deprecate Arrow support in I/O (#16132) @lithomas1
- Return
FrozenListforIndex.names(#16047) @galipremsagar - Add compile option to enable large strings support (#16037) @davidwendt
- Hide visibility of non public symbols (#15982) @robertmaynard
- Rename strings multiple target replace API (#15898) @davidwendt
- Pinned vector factory that uses the global pool (#15895) @vuule
- Apply clang-tidy autofixes (#15894) @vyasr
- Support
arrow:schemain Parquet writer to faithfully roundtripdurationtypes with Arrow (#15875) @mhaseeb123 - Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
- Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
- Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice
π Bug Fixes
- Ensure managed memory is supported in cudf.pandas. (#16552) @bdice
- Add
flatbufferstolibcudfbuild (#16446) @galipremsagar - Fix parquetfieldlist read_func lambda capture invalid this pointer (#16440) @davidwendt
- Enable prefetching in cudf.pandas.install() (#16439) @bdice
- Enable prefetching before
runpy(#16427) @galipremsagar - Support thread-safe for
prefetch_config::getandprefetch_config::set(#16425) @ttnghia - Fix a
pandas-2.0missing attribute error (#16416) @galipremsagar - [Bug] Remove loud
NativeFiledeprecation noise forread_parquetfrom S3 (#16415) @rjzamora - Fix nightly memcheck error for empty STREAMINTEROPTEST (#16406) @davidwendt
- Gate ArrowStringArrayNumpySemantics cudf.pandas proxy behind version check (#16401) @mroeschke
- Don't export bsthreadpool (#16398) @KyleFromNVIDIA
- Require fixed width types for casting in
cudf-polars(#16381) @brandon-b-miller - Fix docstring of
DataFrame.apply(#16351) @galipremsagar - Make bool raise for more cudf objects (#16311) @mroeschke
- Rename
.devcontainers for CUDA 12.5 (#16293) @jakirkham - Fix split_record for all empty strings column (#16291) @davidwendt
- Fix logic in to_arrow for empty list column (#16279) @wence-
- [BUG] Make name attr of Index fast slow attrs (#16270) @Matt711
- Add custom name setter and getter for proxy objects in
cudf.pandas(#16234) @Matt711 - Fall back when casting a timestamp to numeric in cudf-polars (#16232) @brandon-b-miller
- Disable large string support for Java build (#16216) @jlowe
- Remove CCCL patch for PR 211. (#16207) @bdice
- Add single offset to an empty ListArray in cudf::to_arrow (#16201) @davidwendt
- Fix
memory_usagewhen calculating nested list column (#16193) @mroeschke - Support at/iat indexers in cudf.pandas (#16177) @mroeschke
- Fix unused-return-value debug build error in fromarrowstream_test.cpp (#16168) @davidwendt
- Fix cudf::strings::replace_multiple hang on empty target (#16167) @davidwendt
- Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
- interpolate returns new column if no values are interpolated (#16158) @mroeschke
- Use provided memory resource for allocating mixed join results. (#16153) @bdice
- Run DFG after verify-alpha-spec (#16151) @KyleFromNVIDIA
- Use size_t to allow large conditional joins (#16127) @bdice
- Allow only scale=0 fixed-point values in fixedwidthcolumn_wrapper (#16120) @davidwendt
- Fix pylibcudf Table.num_rows for 0 columns case and add interop to docs (#16108) @lithomas1
- Add support for proxy
np.flatiterobjects (#16107) @Matt711 - Ensure cudf objects can astype to any type when empty (#16106) @mroeschke
- Support
pd.read_pickleandpd.to_pickleincudf.pandas(#16105) @Matt711 - Fix unnecessarily strict check in parquet chunked reader for choosing split locations. (#16099) @nvdbaranec
- Fix
is_monotonic_*APIs to includenan's(#16085) @galipremsagar - More safely parse CUDA versions when subprocess output is contaminated (#16067) @brandon-b-miller
- fastslowproxy: Don't import assert_eq at top-level (#16063) @wence-
- Prevent bad ColumnAccessor state after .sortindex(axis=1, ignoreindex=True) (#16061) @mroeschke
- Fix ArrowDeviceArray interface to pass address of event (#16058) @zeroshade
- Fix a size overflow bug in hash groupby (#16053) @PointKernel
- Fix
atomic_refscope when multiple blocks are updating the same output (#16051) @vuule - Fix initialization error in to_arrow for empty string views (#16033) @wence-
- Fix the int32 overflow when computing page fragment sizes for large string columns (#16028) @mhaseeb123
- Fix the pool size alignment issue (#16024) @PointKernel
- Improve multibyte-split byte-range performance (#16019) @davidwendt
- Fix target counting in strings char-parallel replace (#16017) @davidwendt
- Support IntervalDtype in cudf.from_pandas (#16014) @mroeschke
- Fix memory size in createbyterangeinfosconsecutive (#16012) @davidwendt
- Hide visibility of non public symbols (#15982) @robertmaynard
- Fix Cython typo preventing proper inheritance (#15978) @vyasr
- Fix convertdtypes with convertinteger=False/convert_floating=True (#15964) @mroeschke
- Fix nunique for
MultiIndex,DataFrame, and all NA case withdropna=False(#15962) @mroeschke - Explicitly build for all GPU architectures (#15959) @vyasr
- Preserve column type and class information in more DataFrame operations (#15949) @mroeschke
- Add array_interface to cudf.pandas numpy.ndarray proxy (#15936) @mroeschke
- Allow tests to be built when stream util is disabled (#15933) @robertmaynard
- Fix JSON multi-source reading when total source size exceeds
INT_MAXbytes (#15930) @shrshi - Fix
dask_cudf.read_parquetregression for legacy timestamp data (#15929) @rjzamora - Fix offsetalator when accessing over 268 million rows (#15921) @davidwendt
- Fix debug assert in rowgroupcharcounts_kernel (#15902) @davidwendt
- Fix categorical conversion from chunked arrow arrays (#15886) @vyasr
- Handling for
NaNandinfwhen converting floating point to fixed point types (#15885) @ttnghia - Manual merge of Branch 24.08 from 24.06 (#15869) @galipremsagar
- Avoid unnecessary
Indexcast inIndexedFrame.indexsetter (#15843) @charlesbluca - Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
- Fix multi-replace target count logic for large strings (#15807) @davidwendt
- Fix JSON parsing memory corruption - Fix Mixed types nested children removal (#15798) @karthikeyann
- Allow anonymous user in devcontainer name. (#15784) @bdice
- Add support for additional metaclasses of proxies and use for ExcelWriter (#15399) @vyasr
π Documentation
- Add docstring for from_dataframe (#16260) @mroeschke
- Update libcudf compiler requirements in contributing doc (#16103) @davidwendt
- Add libcudf public/detail API pattern to developer guide (#16086) @davidwendt
- Explain line profiler and how to know which functions are GPU-accelerated. (#16079) @bdice
- cudf.pandas documentation improvement (#15948) @Matt711
- Reland "Fix docs for IO readers and strings_convert" (#15872)" (#15941) @lithomas1
- Document how to use cudf.pandas in tandem with multiprocessing (#15940) @wence-
- DOC: Add documentation for cudf.pandas in the Developer Guide (#15889) @Matt711
- Improve options docs (#15888) @bdice
- DOC: add linkcode to docs (#15860) @raybellwaves
- DOC: use intersphinx mapping in pandas-compat ext (#15846) @raybellwaves
- Fix inconsistent usage of 'results' and 'records' in read-json.md (#15766) @dagardner-nv
- Update PandasCompat.py to resolve references (#15704) @raybellwaves
π New Features
- Warn on cuDF failure when
POLARS_VERBOSEis true (#16308) @brandon-b-miller - Add
drop_nullsincudf-polars(#16290) @brandon-b-miller - [JNI] Add setKernelPinnedCopyThreshold and setPinnedAllocationThreshold (#16288) @abellina
- Implement support for scan_ndjson in cudf-polars (#16263) @lithomas1
- Publish cudf-polars nightlies (#16213) @lithomas1
- Modify
make_host_vectorandmake_device_uvectorfactories to optionally use pinned memory and kernel copy (#16206) @vuule - Migrate lists/set_operations to pylibcudf (#16190) @Matt711
- Migrate lists/filling to pylibcudf (#16189) @Matt711
- Fall back to CPU for unsupported libcudf binaryops in cudf-polars (#16188) @brandon-b-miller
- Use resourceref for upstream in streamcheckingresourceadaptor (#16187) @harrism
- Migrate lists/modifying to pylibcudf (#16185) @Matt711
- Migrate lists/filtering to pylibcudf (#16184) @Matt711
- Migrate lists/sorting to pylibcudf (#16179) @Matt711
- Add missing methods to lists/listcolumnview.pxd in pylibcudf (#16175) @Matt711
- Migrate pylibcudf lists gathering (#16170) @Matt711
- Move kernel vis over to CUDF_HIDDEN (#16165) @robertmaynard
- Add groupby_max multi-threaded benchmark (#16154) @srinivasyadav18
- Promote hasnestedcolumns to cudf public API (#16131) @robertmaynard
- Promote IO support queries to cudf API (#16125) @robertmaynard
- cudf::merge public API now support passing a user stream (#16124) @robertmaynard
- Add TPC-H inspired examples for Libcudf (#16088) @JayjeetAtGithub
- Installed cudf header use cudf::allocate_like (#16087) @robertmaynard
cudf-polarsstring slicing (#16082) @brandon-b-miller- Migrate Parquet reader to pylibcudf (#16078) @lithomas1
- Migrate lists/count_elements to pylibcudf (#16072) @Matt711
- Migrate lists/extract to pylibcudf (#16071) @Matt711
- Move common string utilities to public api (#16070) @robertmaynard
- stable_distinct public api now has a stream parameter (#16068) @robertmaynard
- Migrate expressions to pylibcudf (#16056) @lithomas1
- Add support to ArrowDataSource in SourceInfo (#16050) @lithomas1
- Experimental support for configurable prefetching (#16020) @vyasr
- Migrate CSV reader to pylibcudf (#16011) @lithomas1
- Migrate string
sliceAPIs topylibcudf(#15988) @brandon-b-miller - Migrate lists/contains to pylibcudf (#15981) @Matt711
- Remove CCCL 2.2 patches as we now always use 2.5+ (#15969) @robertmaynard
- Migrate JSON reader to pylibcudf (#15966) @lithomas1
- Add a developer check for proxy objects (#15956) @Matt711
- Start migrating I/O writers to pylibcudf (starting with JSON) (#15952) @lithomas1
- Kernel copy for pinned memory (#15934) @vuule
- Migrate left join and conditional join benchmarks to use nvbench (#15931) @srinivasyadav18
- Migrate lists/combine to pylibcudf (#15928) @Matt711
- Plumb pylibcudf strings
contains_rethrough cudf_polars (#15918) @brandon-b-miller - Start migrating I/O to pylibcudf (#15899) @lithomas1
- Pinned vector factory that uses the global pool (#15895) @vuule
- Migrate strings
containsoperations topylibcudf(#15880) @brandon-b-miller - Migrate quantile.pxd to pylibcudf (#15874) @lithomas1
- Migrate round to pylibcudf (#15863) @lithomas1
- Migrate string replace.pxd to pylibcudf (#15839) @lithomas1
- Add an Environment Variable for debugging the fast path in cudf.pandas (#15837) @Matt711
- Add an option to run cuIO benchmarks with pinned buffers as input (#15830) @vuule
- Update
pylibcudftesting utilities (#15772) @brandon-b-miller - Migrate string
capitalizeAPIs topylibcudf(#15503) @brandon-b-miller - Add tests for
pylibcudfbinaryops (#15470) @brandon-b-miller - Migrate column factories to pylibcudf (#15257) @brandon-b-miller
- cuDF/libcudf exponentially weighted moving averages (#9027) @brandon-b-miller
π οΈ Improvements
- Ensure objects with interface are converted to cupy/numpy arrays (#16436) @mroeschke
- Add about rmm modes in
cudf.pandasdocs (#16404) @galipremsagar - Gracefully CUDFFAIL when `skiprows > 0` in Chunked Parquet reader (#16385) @mhaseeb123
- Make C++ compilation warning free after #16297 (#16379) @wence-
- Align Index init APIs with pandas 2.x (#16362) @mroeschke
- Use rapidscpmbsthreadpool() (#16360) @KyleFromNVIDIA
- Rename PrefetchConfig to prefetch_config. (#16358) @bdice
- Implement parquet reading using pylibcudf in cudf-polars (#16346) @lithomas1
- Fix compile warnings with
jni_utils.hpp(#16336) @ttnghia - Align Series APIs with pandas 2.x (#16333) @mroeschke
- Add missing
streamparam to dictionary factory APIs (#16319) @JayjeetAtGithub - Mark cudf._typing as a typing module in ruff (#16318) @mroeschke
- Add
streamparam to list explode APIs (#16317) @JayjeetAtGithub - Fix polars for 1.2.1 (#16316) @lithomas1
- Use workflow branch 24.08 again (#16314) @KyleFromNVIDIA
- Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
- Remove squeeze argument from groupby (#16312) @mroeschke
- Align more DataFrame APIs with pandas (#16310) @mroeschke
- Clean unneeded/redudant dtype utils (#16309) @mroeschke
- Implement read_csv in cudf-polars using pylibcudf (#16307) @lithomas1
- Use Column.cancastsafely instead of some ad-hoc dtype functions in .where (#16303) @mroeschke
- Drop
{{ pin_compatible('numpy', max_pin='x') }}(#16301) @jakirkham - Host implementation of
to_arrowusing nanoarrow (#16297) @zeroshade - Add ability to prefetch in
cudf.pandasand change default to managed pool (#16296) @galipremsagar - Fix tests for polars 1.2 (#16292) @lithomas1
- Introduce dedicated options for low memory readers (#16289) @galipremsagar
- Remove decimal/floating 64/128bit switches due to register pressure (#16287) @pmattione-nvidia
- Make ColumnAccessor strictly require a mapping of columns (#16285) @mroeschke
- Introduce version file so we can conditionally handle things in tests (#16280) @wence-
- Type & reduce cupy usage (#16277) @mroeschke
- Update cudf::detail::grid1d to use threadindex_type (#16276) @davidwendt
- Replace np.isscalar/issubdtype checks with is_scalar/.kind checks (#16275) @mroeschke
- Remove xml from sortninjalog.py utility (#16274) @davidwendt
- Fix issue in horizontal concat implementation in cudf-polars (#16271) @wence-
- Preserve order in left join for cudf-polars (#16268) @wence-
- Replace isdatetime/timedeltadtype checks with .kind checks (#16262) @mroeschke
- Replace isfloat/integerdtype checks with .kind checks (#16261) @mroeschke
- Build and test with CUDA 12.5.1 (#16259) @KyleFromNVIDIA
- Replace isbooltype with checking .dtype.kind (#16255) @mroeschke
- remove
cuco_noexcept.diff(#16254) @trxcllnt - Update contains_tests.cpp to use public cudf::slice (#16253) @davidwendt
- Improve the test data for pylibcudf I/O tests (#16247) @lithomas1
- Short circuit some Column methods (#16246) @mroeschke
- Make nvcomp adapter compatible with new version macros (#16245) @vuule
- Add Column.strftime/strptime instead of overloading
as_string/datetime/timedelta_column(#16243) @mroeschke - Remove temporary functor overloads required by cuco version bump (#16242) @PointKernel
- Remove hashcharacterngrams dependency from jaccard_index (#16241) @davidwendt
- Expose sorted groupby parameters to pylibcudf (#16240) @wence-
- Expose reflection to check if casting between two types is supported (#16239) @wence-
- Handle nans in groupby-aggregations in polars executor (#16233) @wence-
- Remove
mrparam fromwrite_csvandwrite_json(#16231) @JayjeetAtGithub - Support Literals in groupby-agg (#16218) @wence-
- Handler csv reader options in cudf-polars (#16211) @wence-
- Update vendored thread_pool implementation (#16210) @wence-
- Add low memory JSON reader for
cudf.pandas(#16204) @galipremsagar - Clean up state variables in MultiIndex (#16203) @mroeschke
- skip CMake 3.30.0 (#16202) @jameslamb
- Assert valid metadata is passed in toarrow for listview (#16198) @wence-
- Expose type traits to pylibcudf (#16197) @wence-
- Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
- Cast count aggs to correct dtype in translation (#16192) @wence-
- Some small fixes in cudf-polars (#16191) @wence-
- split up CUDA-suffixed dependencies in dependencies.yaml (#16183) @jameslamb
- Define PTDS for the stream hook libs (#16182) @trxcllnt
- Make
test_python_cudf_pandasgeneraterequirements.txt(#16181) @trxcllnt - Add environment-agnostic
ci/run_cudf_polars_pytest.sh(#16178) @trxcllnt - Implement translation for some unary functions and a single datetime extraction (#16173) @wence-
- Remove size constraints on source files in batched JSON reading (#16162) @shrshi
- CI: Build wheels for cudf-polars (#16156) @lithomas1
- Update cudf-polars for v1 release of polars (#16149) @wence-
- Use strings concatenate to support large strings in CSV writer (#16148) @davidwendt
- Use verify-alpha-spec hook (#16144) @KyleFromNVIDIA
- Adds write-coalescing code path optimization to FST (#16143) @elstehle
- MAINT: Adapt to NumPy 2 promotion changes (#16141) @seberg
- API: Check for integer overflows when creating scalar form python int (#16140) @seberg
- Remove the (unused) implementation of
host_parse_nested_json(#16135) @vuule - Deprecate Arrow support in I/O (#16132) @lithomas1
- Disable dict support for split-page kernel in the parquet reader. (#16128) @nvdbaranec
- Add throughput metrics for REDUCTIONBENCH/REDUCTIONNVBENCH benchmarks (#16126) @jihoonson
- Add ensure_index to not unnecessarily shallow copy cudf.Index (#16117) @mroeschke
- Make binary operators work between fixed-point and floating args (#16116) @pmattione-nvidia
- Implement Ternary copyifelse (#16114) @wence-
- Implement handlers for series literal in cudf-polars (#16113) @wence-
- Fix dtype errors in
StringArrays(#16111) @galipremsagar - Ensure MultiIndex.to_frame deep copies columns (#16110) @mroeschke
- Parallelize
gpuInitStringDescriptorsfor fixed length byte array data (#16109) @mhaseeb123 - Finish implementation of cudf-polars boolean function handlers (#16098) @wence-
- Expose and then implement support for cross joins in cudf-polars (#16097) @wence-
- Defer copying in Column.astype(copy=True) (#16095) @mroeschke
- Fix segfault in conditional join (#16094) @bdice
- Free temp memory no longer needed in multibyte_split processing (#16091) @davidwendt
- Rename gather/scatter benchmarks to clarify coalesced behavior. (#16083) @bdice
- Adapt to polars upstream changes and turn on CI testing (#16081) @wence-
- Reduce/clean copy usage in Series, reshaping (#16080) @mroeschke
- Account for FIXEDLENBYTE_ARRAY when calculating fragment sizes in Parquet writer (#16064) @etseidl
- Reduce (shallow) copies in DataFrame ops (#16060) @mroeschke
- Add multi-file support to
dask_cudf.read_json(#16057) @rjzamora - Reduce deep copies in Index ops (#16054) @mroeschke
- Implement chunked column wise concat in chunked parquet reader (#16052) @galipremsagar
- Add exception when trying to create large strings with cudf::test::stringscolumnwrapper (#16049) @davidwendt
- Return
FrozenListforIndex.names(#16047) @galipremsagar - Add ast cast test (#16045) @pmattione-nvidia
- Remove
override_dtypesandinclude_indexfromFrame._copy_type_metadata(#16043) @mroeschke - Add ruff rules to avoid importing from typing (#16040) @mroeschke
- Fix decimal -> float cast in ast code (#16038) @pmattione-nvidia
- Add compile option to enable large strings support (#16037) @davidwendt
- Reduce conditional_join nvbench configurations (#16036) @srinivasyadav18
- Project automation update: skip if not in project (#16035) @jarmak-nv
- Add stream parameter to cudf::io::text::multibyte_split (#16034) @davidwendt
- Delete unused code from stringfunction evaluator (#16032) @wence-
- Fix exclude regex in pre-commit clang-format hook (#16030) @wence-
- Refactor rmm usage in
cudf.pandas(#16021) @galipremsagar - Enable ruff TCH: typing imports under if TYPE_CHECKING (#16015) @mroeschke
- Restrict the allowed pandas timezone objects in cudf (#16013) @mroeschke
- orc multithreaded benchmark (#16009) @zpuller
- Add tests of expression-based sort and sort-by (#16008) @wence-
- Add tests of implemented StringFunctions (#16007) @wence-
- Add test that diagonal concat with mismatching schemas raises (#16006) @wence-
- Add coverage selecting len from a dataframe (number of rows) (#16005) @wence-
- Add basic tests of dataframe scan (#16003) @wence-
- Add coverage for both expression and dataframe filter (#16002) @wence-
- Remove deprecated ExtContext node (#16001) @wence-
- Fix typo bug in gather implementation (#16000) @wence-
- Extend coverage of groupby and rolling window nodes (#15999) @wence-
- Coverage of binops where one or both operands are a scalar (#15998) @wence-
- Add full coverage for whole-frame Agg expressions (#15997) @wence-
- Add tests covering magic methods of Expr objects (#15996) @wence-
- Add full coverage of utility functions (#15995) @wence-
- Test behaviour of containers (#15994) @wence-
- Fix implemention of any, all, and isbetween (#15993) @wence-
- Raise early on unhandled PythonScan node (#15992) @wence-
- Remove mapfunction nodes that don't exist/aren't supported (#15991) @wence-
- Add test coverage for slicing with "out of bounds" negative indices (#15990) @wence-
- Standardize and type
Series.dtmethods (#15987) @mroeschke - Refactor distinct with hashset-based algorithms (#15984) @srinivasyadav18
- resolve dependency-file-generator warning, remove unnecessary rapids-build-backend configuration (#15980) @jameslamb
- Project automation bug fixes (#15971) @jarmak-nv
- Add typing to singlecolumnframe (#15965) @mroeschke
- Move some misc Frame methods to appropriate locations (#15963) @mroeschke
- Condense pylibcudf data fixtures (#15958) @lithomas1
- Refactor fillna logic to push specifics toward Frame subclasses and Column subclasses (#15957) @mroeschke
- Remove unused parsing utilities (#15955) @vuule
- Remove
Scalarcontainer type from polars interpreter (#15953) @wence- - Support arbitrary CUDA versions in UDF code (#15950) @bdice
- Support large strings in cudf::io::text::multibyte_split (#15947) @davidwendt
- Add external issue label and project automation (#15945) @jarmak-nv
- Enable round-tripping of large strings in
cudf(#15944) @galipremsagar - Add more complete type annotations in polars interpreter (#15942) @wence-
- Update implementations to build with the latest cuco (#15938) @PointKernel
- Support timezone aware pandas inputs in cudf (#15935) @mroeschke
- Define Column.nanasnull to return self (#15923) @mroeschke
- Make Frame._dtype an iterator instead of a dict (#15920) @mroeschke
- Port start of datetime.hpp to pylibcudf (#15916) @wence-
- Introduce
NamedColumnconcept in cudf-polars (#15914) @wence- - Avoid redefining Frame.getcolumnsbylabel in subclasses (#15912) @mroeschke
- Templatization of fixed-width parquet decoding kernels. (#15911) @nvdbaranec
- New Decimal <--> Floating conversion (#15905) @pmattione-nvidia
- Use Arrow C Data Interface functions for Python interop (#15904) @vyasr
- Use offsetalator in cudf::io::json::detail::parse_string (#15900) @davidwendt
- Rename strings multiple target replace API (#15898) @davidwendt
- Apply clang-tidy autofixes (#15894) @vyasr
- Update Python labels and remove unnecessary ones (#15893) @vyasr
- Clean up pylibcudf test assertations (#15892) @lithomas1
- Use offsetalator in orc rowgroupcharcounts_kernel (#15891) @davidwendt
- Ensure literals have correct dtype (#15890) @wence-
- Add overflow check when converting large strings to lists columns (#15887) @davidwendt
- Use offsetalator in nvtext::tokenizewithvocabulary (#15878) @davidwendt
- Update interleave lists column for large strings (#15877) @davidwendt
- Simple NumPy 2 fixes that are clearly no behavior change (#15876) @seberg
- Support
arrow:schemain Parquet writer to faithfully roundtripdurationtypes with Arrow (#15875) @mhaseeb123 - Refactor join benchmarks to target public APIs with the default stream (#15873) @PointKernel
- Fix url-decode benchmark to use offsetalator (#15871) @davidwendt
- Use offsetalator in strings shift functor (#15870) @davidwendt
- Memory Profiling (#15866) @madsbk
- Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
- Make Frame.astype return Self instead of a ColumnAccessor (#15861) @mroeschke
- Use ColumnAccessor row and column length attributes more consistently (#15857) @mroeschke
- add unit test setup for cudf_kafka (#15853) @jameslamb
- Remove internal usage of core.index.as_index in favor of cudf.Index (#15851) @mroeschke
- Ensure cudf.Series(cudf.Series(...)) creates a reference to the same index (#15845) @mroeschke
- Remove benchmark-specific use of pinned-pooled memory in Parquet multithreaded benchmark. (#15838) @nvdbaranec
- Implement
on_bad_linesin json reader (#15834) @galipremsagar - Make Column.to_pandas return Index instead of Series (#15833) @mroeschke
- Add test of interoperability of cuDF and arrow BYTESTREAMSPLIT encoders (#15832) @etseidl
- Refactor Parquet writer options and builders (#15831) @etseidl
- Migrate reshape.pxd to pylibcudf (#15827) @lithomas1
- Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice
- Switch cuIO benchmarks to use pinned-pool host allocations by default. (#15805) @nvdbaranec
- Change thrust::count_if call to raw kernel in strings split APIs (#15762) @davidwendt
- Improve performance for long strings for nvtext::replace_tokens (#15756) @davidwendt
- Implement chunked parquet reader in cudf-python (#15728) @galipremsagar
- Add
from_arrow_hostfunctions for cudf interop with nanoarrow (#15645) @zeroshade - Add ability to enable rmm pool on
cudf.pandasimport (#15628) @galipremsagar - Executor for polars logical plans (#15504) @wence-
- Implement dayname and monthname to match pandas (#15479) @btepera
- Utilities for decimal <--> floating conversion (#15359) @pmattione-nvidia
- For powers of 10, replace ipow with switch (#15353) @pmattione-nvidia
- Use rapids-build-backend. (#15245) @vyasr
- Add
codecovcoverage forpandas_tests(#14513) @galipremsagar
- C++
Published by raydouglass over 1 year ago
https://github.com/rapidsai/cudf - v24.08.00
π¨ Breaking Changes
- Align Index init APIs with pandas 2.x (#16362) @mroeschke
- Align Series APIs with pandas 2.x (#16333) @mroeschke
- Add missing
streamparam to dictionary factory APIs (#16319) @JayjeetAtGithub - Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
- Remove squeeze argument from groupby (#16312) @mroeschke
- Align more DataFrame APIs with pandas (#16310) @mroeschke
- Remove
mrparam fromwrite_csvandwrite_json(#16231) @JayjeetAtGithub - Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
- Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
- Deprecate Arrow support in I/O (#16132) @lithomas1
- Return
FrozenListforIndex.names(#16047) @galipremsagar - Add compile option to enable large strings support (#16037) @davidwendt
- Hide visibility of non public symbols (#15982) @robertmaynard
- Rename strings multiple target replace API (#15898) @davidwendt
- Pinned vector factory that uses the global pool (#15895) @vuule
- Apply clang-tidy autofixes (#15894) @vyasr
- Support
arrow:schemain Parquet writer to faithfully roundtripdurationtypes with Arrow (#15875) @mhaseeb123 - Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
- Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
- Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice
π Bug Fixes
- Add
flatbufferstolibcudfbuild (#16446) @galipremsagar - Fix parquetfieldlist read_func lambda capture invalid this pointer (#16440) @davidwendt
- Enable prefetching in cudf.pandas.install() (#16439) @bdice
- Enable prefetching before
runpy(#16427) @galipremsagar - Support thread-safe for
prefetch_config::getandprefetch_config::set(#16425) @ttnghia - Fix a
pandas-2.0missing attribute error (#16416) @galipremsagar - [Bug] Remove loud
NativeFiledeprecation noise forread_parquetfrom S3 (#16415) @rjzamora - Fix nightly memcheck error for empty STREAMINTEROPTEST (#16406) @davidwendt
- Gate ArrowStringArrayNumpySemantics cudf.pandas proxy behind version check (#16401) @mroeschke
- Don't export bsthreadpool (#16398) @KyleFromNVIDIA
- Require fixed width types for casting in
cudf-polars(#16381) @brandon-b-miller - Fix docstring of
DataFrame.apply(#16351) @galipremsagar - Make bool raise for more cudf objects (#16311) @mroeschke
- Rename
.devcontainers for CUDA 12.5 (#16293) @jakirkham - Fix split_record for all empty strings column (#16291) @davidwendt
- Fix logic in to_arrow for empty list column (#16279) @wence-
- [BUG] Make name attr of Index fast slow attrs (#16270) @Matt711
- Add custom name setter and getter for proxy objects in
cudf.pandas(#16234) @Matt711 - Fall back when casting a timestamp to numeric in cudf-polars (#16232) @brandon-b-miller
- Disable large string support for Java build (#16216) @jlowe
- Remove CCCL patch for PR 211. (#16207) @bdice
- Add single offset to an empty ListArray in cudf::to_arrow (#16201) @davidwendt
- Fix
memory_usagewhen calculating nested list column (#16193) @mroeschke - Support at/iat indexers in cudf.pandas (#16177) @mroeschke
- Fix unused-return-value debug build error in fromarrowstream_test.cpp (#16168) @davidwendt
- Fix cudf::strings::replace_multiple hang on empty target (#16167) @davidwendt
- Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
- interpolate returns new column if no values are interpolated (#16158) @mroeschke
- Use provided memory resource for allocating mixed join results. (#16153) @bdice
- Run DFG after verify-alpha-spec (#16151) @KyleFromNVIDIA
- Use size_t to allow large conditional joins (#16127) @bdice
- Allow only scale=0 fixed-point values in fixedwidthcolumn_wrapper (#16120) @davidwendt
- Fix pylibcudf Table.num_rows for 0 columns case and add interop to docs (#16108) @lithomas1
- Add support for proxy
np.flatiterobjects (#16107) @Matt711 - Ensure cudf objects can astype to any type when empty (#16106) @mroeschke
- Support
pd.read_pickleandpd.to_pickleincudf.pandas(#16105) @Matt711 - Fix unnecessarily strict check in parquet chunked reader for choosing split locations. (#16099) @nvdbaranec
- Fix
is_monotonic_*APIs to includenan's(#16085) @galipremsagar - More safely parse CUDA versions when subprocess output is contaminated (#16067) @brandon-b-miller
- fastslowproxy: Don't import assert_eq at top-level (#16063) @wence-
- Prevent bad ColumnAccessor state after .sortindex(axis=1, ignoreindex=True) (#16061) @mroeschke
- Fix ArrowDeviceArray interface to pass address of event (#16058) @zeroshade
- Fix a size overflow bug in hash groupby (#16053) @PointKernel
- Fix
atomic_refscope when multiple blocks are updating the same output (#16051) @vuule - Fix initialization error in to_arrow for empty string views (#16033) @wence-
- Fix the int32 overflow when computing page fragment sizes for large string columns (#16028) @mhaseeb123
- Fix the pool size alignment issue (#16024) @PointKernel
- Improve multibyte-split byte-range performance (#16019) @davidwendt
- Fix target counting in strings char-parallel replace (#16017) @davidwendt
- Support IntervalDtype in cudf.from_pandas (#16014) @mroeschke
- Fix memory size in createbyterangeinfosconsecutive (#16012) @davidwendt
- Hide visibility of non public symbols (#15982) @robertmaynard
- Fix Cython typo preventing proper inheritance (#15978) @vyasr
- Fix convertdtypes with convertinteger=False/convert_floating=True (#15964) @mroeschke
- Fix nunique for
MultiIndex,DataFrame, and all NA case withdropna=False(#15962) @mroeschke - Explicitly build for all GPU architectures (#15959) @vyasr
- Preserve column type and class information in more DataFrame operations (#15949) @mroeschke
- Add array_interface to cudf.pandas numpy.ndarray proxy (#15936) @mroeschke
- Allow tests to be built when stream util is disabled (#15933) @robertmaynard
- Fix JSON multi-source reading when total source size exceeds
INT_MAXbytes (#15930) @shrshi - Fix
dask_cudf.read_parquetregression for legacy timestamp data (#15929) @rjzamora - Fix offsetalator when accessing over 268 million rows (#15921) @davidwendt
- Fix debug assert in rowgroupcharcounts_kernel (#15902) @davidwendt
- Fix categorical conversion from chunked arrow arrays (#15886) @vyasr
- Handling for
NaNandinfwhen converting floating point to fixed point types (#15885) @ttnghia - Manual merge of Branch 24.08 from 24.06 (#15869) @galipremsagar
- Avoid unnecessary
Indexcast inIndexedFrame.indexsetter (#15843) @charlesbluca - Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
- Fix multi-replace target count logic for large strings (#15807) @davidwendt
- Fix JSON parsing memory corruption - Fix Mixed types nested children removal (#15798) @karthikeyann
- Allow anonymous user in devcontainer name. (#15784) @bdice
- Add support for additional metaclasses of proxies and use for ExcelWriter (#15399) @vyasr
π Documentation
- Add docstring for from_dataframe (#16260) @mroeschke
- Update libcudf compiler requirements in contributing doc (#16103) @davidwendt
- Add libcudf public/detail API pattern to developer guide (#16086) @davidwendt
- Explain line profiler and how to know which functions are GPU-accelerated. (#16079) @bdice
- cudf.pandas documentation improvement (#15948) @Matt711
- Reland "Fix docs for IO readers and strings_convert" (#15872)" (#15941) @lithomas1
- Document how to use cudf.pandas in tandem with multiprocessing (#15940) @wence-
- DOC: Add documentation for cudf.pandas in the Developer Guide (#15889) @Matt711
- Improve options docs (#15888) @bdice
- DOC: add linkcode to docs (#15860) @raybellwaves
- DOC: use intersphinx mapping in pandas-compat ext (#15846) @raybellwaves
- Fix inconsistent usage of 'results' and 'records' in read-json.md (#15766) @dagardner-nv
- Update PandasCompat.py to resolve references (#15704) @raybellwaves
π New Features
- Warn on cuDF failure when
POLARS_VERBOSEis true (#16308) @brandon-b-miller - Add
drop_nullsincudf-polars(#16290) @brandon-b-miller - [JNI] Add setKernelPinnedCopyThreshold and setPinnedAllocationThreshold (#16288) @abellina
- Implement support for scan_ndjson in cudf-polars (#16263) @lithomas1
- Publish cudf-polars nightlies (#16213) @lithomas1
- Modify
make_host_vectorandmake_device_uvectorfactories to optionally use pinned memory and kernel copy (#16206) @vuule - Migrate lists/set_operations to pylibcudf (#16190) @Matt711
- Migrate lists/filling to pylibcudf (#16189) @Matt711
- Fall back to CPU for unsupported libcudf binaryops in cudf-polars (#16188) @brandon-b-miller
- Use resourceref for upstream in streamcheckingresourceadaptor (#16187) @harrism
- Migrate lists/modifying to pylibcudf (#16185) @Matt711
- Migrate lists/filtering to pylibcudf (#16184) @Matt711
- Migrate lists/sorting to pylibcudf (#16179) @Matt711
- Add missing methods to lists/listcolumnview.pxd in pylibcudf (#16175) @Matt711
- Migrate pylibcudf lists gathering (#16170) @Matt711
- Move kernel vis over to CUDF_HIDDEN (#16165) @robertmaynard
- Add groupby_max multi-threaded benchmark (#16154) @srinivasyadav18
- Promote hasnestedcolumns to cudf public API (#16131) @robertmaynard
- Promote IO support queries to cudf API (#16125) @robertmaynard
- cudf::merge public API now support passing a user stream (#16124) @robertmaynard
- Add TPC-H inspired examples for Libcudf (#16088) @JayjeetAtGithub
- Installed cudf header use cudf::allocate_like (#16087) @robertmaynard
cudf-polarsstring slicing (#16082) @brandon-b-miller- Migrate Parquet reader to pylibcudf (#16078) @lithomas1
- Migrate lists/count_elements to pylibcudf (#16072) @Matt711
- Migrate lists/extract to pylibcudf (#16071) @Matt711
- Move common string utilities to public api (#16070) @robertmaynard
- stable_distinct public api now has a stream parameter (#16068) @robertmaynard
- Migrate expressions to pylibcudf (#16056) @lithomas1
- Add support to ArrowDataSource in SourceInfo (#16050) @lithomas1
- Experimental support for configurable prefetching (#16020) @vyasr
- Migrate CSV reader to pylibcudf (#16011) @lithomas1
- Migrate string
sliceAPIs topylibcudf(#15988) @brandon-b-miller - Migrate lists/contains to pylibcudf (#15981) @Matt711
- Remove CCCL 2.2 patches as we now always use 2.5+ (#15969) @robertmaynard
- Migrate JSON reader to pylibcudf (#15966) @lithomas1
- Add a developer check for proxy objects (#15956) @Matt711
- Start migrating I/O writers to pylibcudf (starting with JSON) (#15952) @lithomas1
- Kernel copy for pinned memory (#15934) @vuule
- Migrate left join and conditional join benchmarks to use nvbench (#15931) @srinivasyadav18
- Migrate lists/combine to pylibcudf (#15928) @Matt711
- Plumb pylibcudf strings
contains_rethrough cudf_polars (#15918) @brandon-b-miller - Start migrating I/O to pylibcudf (#15899) @lithomas1
- Pinned vector factory that uses the global pool (#15895) @vuule
- Migrate strings
containsoperations topylibcudf(#15880) @brandon-b-miller - Migrate quantile.pxd to pylibcudf (#15874) @lithomas1
- Migrate round to pylibcudf (#15863) @lithomas1
- Migrate string replace.pxd to pylibcudf (#15839) @lithomas1
- Add an Environment Variable for debugging the fast path in cudf.pandas (#15837) @Matt711
- Add an option to run cuIO benchmarks with pinned buffers as input (#15830) @vuule
- Update
pylibcudftesting utilities (#15772) @brandon-b-miller - Migrate string
capitalizeAPIs topylibcudf(#15503) @brandon-b-miller - Add tests for
pylibcudfbinaryops (#15470) @brandon-b-miller - Migrate column factories to pylibcudf (#15257) @brandon-b-miller
- cuDF/libcudf exponentially weighted moving averages (#9027) @brandon-b-miller
π οΈ Improvements
- Ensure objects with interface are converted to cupy/numpy arrays (#16436) @mroeschke
- Add about rmm modes in
cudf.pandasdocs (#16404) @galipremsagar - Gracefully CUDFFAIL when `skiprows > 0` in Chunked Parquet reader (#16385) @mhaseeb123
- Make C++ compilation warning free after #16297 (#16379) @wence-
- Align Index init APIs with pandas 2.x (#16362) @mroeschke
- Use rapidscpmbsthreadpool() (#16360) @KyleFromNVIDIA
- Rename PrefetchConfig to prefetch_config. (#16358) @bdice
- Implement parquet reading using pylibcudf in cudf-polars (#16346) @lithomas1
- Fix compile warnings with
jni_utils.hpp(#16336) @ttnghia - Align Series APIs with pandas 2.x (#16333) @mroeschke
- Add missing
streamparam to dictionary factory APIs (#16319) @JayjeetAtGithub - Mark cudf._typing as a typing module in ruff (#16318) @mroeschke
- Add
streamparam to list explode APIs (#16317) @JayjeetAtGithub - Fix polars for 1.2.1 (#16316) @lithomas1
- Use workflow branch 24.08 again (#16314) @KyleFromNVIDIA
- Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
- Remove squeeze argument from groupby (#16312) @mroeschke
- Align more DataFrame APIs with pandas (#16310) @mroeschke
- Clean unneeded/redudant dtype utils (#16309) @mroeschke
- Implement read_csv in cudf-polars using pylibcudf (#16307) @lithomas1
- Use Column.cancastsafely instead of some ad-hoc dtype functions in .where (#16303) @mroeschke
- Drop
{{ pin_compatible('numpy', max_pin='x') }}(#16301) @jakirkham - Host implementation of
to_arrowusing nanoarrow (#16297) @zeroshade - Add ability to prefetch in
cudf.pandasand change default to managed pool (#16296) @galipremsagar - Fix tests for polars 1.2 (#16292) @lithomas1
- Introduce dedicated options for low memory readers (#16289) @galipremsagar
- Remove decimal/floating 64/128bit switches due to register pressure (#16287) @pmattione-nvidia
- Make ColumnAccessor strictly require a mapping of columns (#16285) @mroeschke
- Introduce version file so we can conditionally handle things in tests (#16280) @wence-
- Type & reduce cupy usage (#16277) @mroeschke
- Update cudf::detail::grid1d to use threadindex_type (#16276) @davidwendt
- Replace np.isscalar/issubdtype checks with is_scalar/.kind checks (#16275) @mroeschke
- Remove xml from sortninjalog.py utility (#16274) @davidwendt
- Fix issue in horizontal concat implementation in cudf-polars (#16271) @wence-
- Preserve order in left join for cudf-polars (#16268) @wence-
- Replace isdatetime/timedeltadtype checks with .kind checks (#16262) @mroeschke
- Replace isfloat/integerdtype checks with .kind checks (#16261) @mroeschke
- Build and test with CUDA 12.5.1 (#16259) @KyleFromNVIDIA
- Replace isbooltype with checking .dtype.kind (#16255) @mroeschke
- remove
cuco_noexcept.diff(#16254) @trxcllnt - Update contains_tests.cpp to use public cudf::slice (#16253) @davidwendt
- Improve the test data for pylibcudf I/O tests (#16247) @lithomas1
- Short circuit some Column methods (#16246) @mroeschke
- Make nvcomp adapter compatible with new version macros (#16245) @vuule
- Add Column.strftime/strptime instead of overloading
as_string/datetime/timedelta_column(#16243) @mroeschke - Remove temporary functor overloads required by cuco version bump (#16242) @PointKernel
- Remove hashcharacterngrams dependency from jaccard_index (#16241) @davidwendt
- Expose sorted groupby parameters to pylibcudf (#16240) @wence-
- Expose reflection to check if casting between two types is supported (#16239) @wence-
- Handle nans in groupby-aggregations in polars executor (#16233) @wence-
- Remove
mrparam fromwrite_csvandwrite_json(#16231) @JayjeetAtGithub - Support Literals in groupby-agg (#16218) @wence-
- Handler csv reader options in cudf-polars (#16211) @wence-
- Update vendored thread_pool implementation (#16210) @wence-
- Add low memory JSON reader for
cudf.pandas(#16204) @galipremsagar - Clean up state variables in MultiIndex (#16203) @mroeschke
- skip CMake 3.30.0 (#16202) @jameslamb
- Assert valid metadata is passed in toarrow for listview (#16198) @wence-
- Expose type traits to pylibcudf (#16197) @wence-
- Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
- Cast count aggs to correct dtype in translation (#16192) @wence-
- Some small fixes in cudf-polars (#16191) @wence-
- split up CUDA-suffixed dependencies in dependencies.yaml (#16183) @jameslamb
- Define PTDS for the stream hook libs (#16182) @trxcllnt
- Make
test_python_cudf_pandasgeneraterequirements.txt(#16181) @trxcllnt - Add environment-agnostic
ci/run_cudf_polars_pytest.sh(#16178) @trxcllnt - Implement translation for some unary functions and a single datetime extraction (#16173) @wence-
- Remove size constraints on source files in batched JSON reading (#16162) @shrshi
- CI: Build wheels for cudf-polars (#16156) @lithomas1
- Update cudf-polars for v1 release of polars (#16149) @wence-
- Use strings concatenate to support large strings in CSV writer (#16148) @davidwendt
- Use verify-alpha-spec hook (#16144) @KyleFromNVIDIA
- Adds write-coalescing code path optimization to FST (#16143) @elstehle
- MAINT: Adapt to NumPy 2 promotion changes (#16141) @seberg
- API: Check for integer overflows when creating scalar form python int (#16140) @seberg
- Remove the (unused) implementation of
host_parse_nested_json(#16135) @vuule - Deprecate Arrow support in I/O (#16132) @lithomas1
- Disable dict support for split-page kernel in the parquet reader. (#16128) @nvdbaranec
- Add throughput metrics for REDUCTIONBENCH/REDUCTIONNVBENCH benchmarks (#16126) @jihoonson
- Add ensure_index to not unnecessarily shallow copy cudf.Index (#16117) @mroeschke
- Make binary operators work between fixed-point and floating args (#16116) @pmattione-nvidia
- Implement Ternary copyifelse (#16114) @wence-
- Implement handlers for series literal in cudf-polars (#16113) @wence-
- Fix dtype errors in
StringArrays(#16111) @galipremsagar - Ensure MultiIndex.to_frame deep copies columns (#16110) @mroeschke
- Parallelize
gpuInitStringDescriptorsfor fixed length byte array data (#16109) @mhaseeb123 - Finish implementation of cudf-polars boolean function handlers (#16098) @wence-
- Expose and then implement support for cross joins in cudf-polars (#16097) @wence-
- Defer copying in Column.astype(copy=True) (#16095) @mroeschke
- Fix segfault in conditional join (#16094) @bdice
- Free temp memory no longer needed in multibyte_split processing (#16091) @davidwendt
- Rename gather/scatter benchmarks to clarify coalesced behavior. (#16083) @bdice
- Adapt to polars upstream changes and turn on CI testing (#16081) @wence-
- Reduce/clean copy usage in Series, reshaping (#16080) @mroeschke
- Account for FIXEDLENBYTE_ARRAY when calculating fragment sizes in Parquet writer (#16064) @etseidl
- Reduce (shallow) copies in DataFrame ops (#16060) @mroeschke
- Add multi-file support to
dask_cudf.read_json(#16057) @rjzamora - Reduce deep copies in Index ops (#16054) @mroeschke
- Implement chunked column wise concat in chunked parquet reader (#16052) @galipremsagar
- Add exception when trying to create large strings with cudf::test::stringscolumnwrapper (#16049) @davidwendt
- Return
FrozenListforIndex.names(#16047) @galipremsagar - Add ast cast test (#16045) @pmattione-nvidia
- Remove
override_dtypesandinclude_indexfromFrame._copy_type_metadata(#16043) @mroeschke - Add ruff rules to avoid importing from typing (#16040) @mroeschke
- Fix decimal -> float cast in ast code (#16038) @pmattione-nvidia
- Add compile option to enable large strings support (#16037) @davidwendt
- Reduce conditional_join nvbench configurations (#16036) @srinivasyadav18
- Project automation update: skip if not in project (#16035) @jarmak-nv
- Add stream parameter to cudf::io::text::multibyte_split (#16034) @davidwendt
- Delete unused code from stringfunction evaluator (#16032) @wence-
- Fix exclude regex in pre-commit clang-format hook (#16030) @wence-
- Refactor rmm usage in
cudf.pandas(#16021) @galipremsagar - Enable ruff TCH: typing imports under if TYPE_CHECKING (#16015) @mroeschke
- Restrict the allowed pandas timezone objects in cudf (#16013) @mroeschke
- orc multithreaded benchmark (#16009) @zpuller
- Add tests of expression-based sort and sort-by (#16008) @wence-
- Add tests of implemented StringFunctions (#16007) @wence-
- Add test that diagonal concat with mismatching schemas raises (#16006) @wence-
- Add coverage selecting len from a dataframe (number of rows) (#16005) @wence-
- Add basic tests of dataframe scan (#16003) @wence-
- Add coverage for both expression and dataframe filter (#16002) @wence-
- Remove deprecated ExtContext node (#16001) @wence-
- Fix typo bug in gather implementation (#16000) @wence-
- Extend coverage of groupby and rolling window nodes (#15999) @wence-
- Coverage of binops where one or both operands are a scalar (#15998) @wence-
- Add full coverage for whole-frame Agg expressions (#15997) @wence-
- Add tests covering magic methods of Expr objects (#15996) @wence-
- Add full coverage of utility functions (#15995) @wence-
- Test behaviour of containers (#15994) @wence-
- Fix implemention of any, all, and isbetween (#15993) @wence-
- Raise early on unhandled PythonScan node (#15992) @wence-
- Remove mapfunction nodes that don't exist/aren't supported (#15991) @wence-
- Add test coverage for slicing with "out of bounds" negative indices (#15990) @wence-
- Standardize and type
Series.dtmethods (#15987) @mroeschke - Refactor distinct with hashset-based algorithms (#15984) @srinivasyadav18
- resolve dependency-file-generator warning, remove unnecessary rapids-build-backend configuration (#15980) @jameslamb
- Project automation bug fixes (#15971) @jarmak-nv
- Add typing to singlecolumnframe (#15965) @mroeschke
- Move some misc Frame methods to appropriate locations (#15963) @mroeschke
- Condense pylibcudf data fixtures (#15958) @lithomas1
- Refactor fillna logic to push specifics toward Frame subclasses and Column subclasses (#15957) @mroeschke
- Remove unused parsing utilities (#15955) @vuule
- Remove
Scalarcontainer type from polars interpreter (#15953) @wence- - Support arbitrary CUDA versions in UDF code (#15950) @bdice
- Support large strings in cudf::io::text::multibyte_split (#15947) @davidwendt
- Add external issue label and project automation (#15945) @jarmak-nv
- Enable round-tripping of large strings in
cudf(#15944) @galipremsagar - Add more complete type annotations in polars interpreter (#15942) @wence-
- Update implementations to build with the latest cuco (#15938) @PointKernel
- Support timezone aware pandas inputs in cudf (#15935) @mroeschke
- Define Column.nanasnull to return self (#15923) @mroeschke
- Make Frame._dtype an iterator instead of a dict (#15920) @mroeschke
- Port start of datetime.hpp to pylibcudf (#15916) @wence-
- Introduce
NamedColumnconcept in cudf-polars (#15914) @wence- - Avoid redefining Frame.getcolumnsbylabel in subclasses (#15912) @mroeschke
- Templatization of fixed-width parquet decoding kernels. (#15911) @nvdbaranec
- New Decimal <--> Floating conversion (#15905) @pmattione-nvidia
- Use Arrow C Data Interface functions for Python interop (#15904) @vyasr
- Use offsetalator in cudf::io::json::detail::parse_string (#15900) @davidwendt
- Rename strings multiple target replace API (#15898) @davidwendt
- Apply clang-tidy autofixes (#15894) @vyasr
- Update Python labels and remove unnecessary ones (#15893) @vyasr
- Clean up pylibcudf test assertations (#15892) @lithomas1
- Use offsetalator in orc rowgroupcharcounts_kernel (#15891) @davidwendt
- Ensure literals have correct dtype (#15890) @wence-
- Add overflow check when converting large strings to lists columns (#15887) @davidwendt
- Use offsetalator in nvtext::tokenizewithvocabulary (#15878) @davidwendt
- Update interleave lists column for large strings (#15877) @davidwendt
- Simple NumPy 2 fixes that are clearly no behavior change (#15876) @seberg
- Support
arrow:schemain Parquet writer to faithfully roundtripdurationtypes with Arrow (#15875) @mhaseeb123 - Refactor join benchmarks to target public APIs with the default stream (#15873) @PointKernel
- Fix url-decode benchmark to use offsetalator (#15871) @davidwendt
- Use offsetalator in strings shift functor (#15870) @davidwendt
- Memory Profiling (#15866) @madsbk
- Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
- Make Frame.astype return Self instead of a ColumnAccessor (#15861) @mroeschke
- Use ColumnAccessor row and column length attributes more consistently (#15857) @mroeschke
- add unit test setup for cudf_kafka (#15853) @jameslamb
- Remove internal usage of core.index.as_index in favor of cudf.Index (#15851) @mroeschke
- Ensure cudf.Series(cudf.Series(...)) creates a reference to the same index (#15845) @mroeschke
- Remove benchmark-specific use of pinned-pooled memory in Parquet multithreaded benchmark. (#15838) @nvdbaranec
- Implement
on_bad_linesin json reader (#15834) @galipremsagar - Make Column.to_pandas return Index instead of Series (#15833) @mroeschke
- Add test of interoperability of cuDF and arrow BYTESTREAMSPLIT encoders (#15832) @etseidl
- Refactor Parquet writer options and builders (#15831) @etseidl
- Migrate reshape.pxd to pylibcudf (#15827) @lithomas1
- Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice
- Switch cuIO benchmarks to use pinned-pool host allocations by default. (#15805) @nvdbaranec
- Change thrust::count_if call to raw kernel in strings split APIs (#15762) @davidwendt
- Improve performance for long strings for nvtext::replace_tokens (#15756) @davidwendt
- Implement chunked parquet reader in cudf-python (#15728) @galipremsagar
- Add
from_arrow_hostfunctions for cudf interop with nanoarrow (#15645) @zeroshade - Add ability to enable rmm pool on
cudf.pandasimport (#15628) @galipremsagar - Executor for polars logical plans (#15504) @wence-
- Implement dayname and monthname to match pandas (#15479) @btepera
- Utilities for decimal <--> floating conversion (#15359) @pmattione-nvidia
- For powers of 10, replace ipow with switch (#15353) @pmattione-nvidia
- Use rapids-build-backend. (#15245) @vyasr
- Add
codecovcoverage forpandas_tests(#14513) @galipremsagar
- C++
Published by raydouglass over 1 year ago
https://github.com/rapidsai/cudf - v24.06.01
π¨ Breaking Changes
- Deprecate
Groupby.collect(#15808) @galipremsagar - Raise FileNotFoundError when a literal JSON string that looks like a json filename is passed (#15806) @lithomas1
- Support filtered I/O in
chunked_parquet_readerand simplify the use ofparquet_reader_options(#15764) @mhaseeb123 - Raise errors for unsupported operations on certain types (#15712) @galipremsagar
- Support
DurationTypein cudf parquet reader viaarrow:schema(#15617) @mhaseeb123 - Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
- Remove legacy JSON reader from Python (#15538) @bdice
- Removing all batching code from parquet writer (#15528) @mhaseeb123
- Convert libcudf resource parameters to rmm::deviceasyncresource_ref (#15507) @harrism
- Remove deprecated strings offsets_begin (#15454) @davidwendt
- Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
- Bind
read_parquet_metadataAPI to libcudf instead of pyarrow and extractRowGroupinformation (#15398) @mhaseeb123 - Remove deprecated hash() and sparkmurmurhash3x86_32() (#15375) @davidwendt
- Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
- [FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
- Align date_range defaults with pandas, support tz (#15139) @mroeschke
π Bug Fixes
- Backport: Use size_t to allow large conditional joins (#16127) (#16133) @bdice
- Backport #16045 to 24.06 (#16102) @vyasr
- Backport #16038 to 24.06 (#16101) @vyasr
- Backport: Fix segfault in conditional join (#16094) (#16100) @bdice
- Add patch for incorrect cuco noexcept clauses (#16077) @vyasr
- Revert "Fix docs for IO readers and strings_convert" (#15872) @vyasr
- Remove problematic call of index setter to unblock dask-cuda CI (#15844) @charlesbluca
- Use rapidscpmnvtx3 to get same nvtx3 target state as rmm (#15840) @robertmaynard
- Return boolean from confighostmemory_resource instead of throwing (#15815) @abellina
- Add temporary dask-cudf workaround for categorical sorting (#15801) @rjzamora
- Fix row group alignment in ORC writer (#15789) @vuule
- Raise error when sorting by categorical column in dask-cudf (#15788) @rjzamora
- Upgrade
arrowto 16.1 (#15787) @galipremsagar - Add support for
PandasArrayforpandas<2.1.0(#15786) @galipremsagar - Limit runtime dependency to
libarrow>=16.0.0,<16.1.0a0(#15782) @pentschev - Fix cat.as_ordered not propogating correct size (#15780) @mroeschke
- Handle mixed-like homogeneous types in
isin(#15771) @galipremsagar - Fix idvars and valuevars not accepting string scalars in melt (#15765) @mroeschke
- Fix
DatetimeIndex.locfor all types of ordering cases (#15761) @galipremsagar - Fix arrow versioning logic (#15755) @vyasr
- Avoid running sanitizer on Java test designed to cause an error (#15753) @jlowe
- Handle empty dataframe object with index present in setitem of
loc(#15752) @galipremsagar - Eliminate circular reference in DataFrame/Series.iloc/loc (#15749) @mroeschke
- Cap the absolute row index per pass in parquet chunked reader. (#15735) @nvdbaranec
- Fix
Index.repeatfordatetime64types (#15722) @galipremsagar - Fix multibyte check for case convert for large strings (#15721) @davidwendt
- Fix
get_locto properly fetch results from an index that is in decreasing order (#15719) @galipremsagar - Return same type as the original index for
.locoperations (#15717) @galipremsagar - Correct static builds + static arrow (#15715) @robertmaynard
- Raise errors for unsupported operations on certain types (#15712) @galipremsagar
- Fix ColumnAccessor caching of nrows if empty previously (#15710) @mroeschke
- Allow
Nonewhennan_as_null=Falsein column constructor (#15709) @galipremsagar - Refine
CudaTest.testCudaExceptionin case throwing wrong type of CudaError under aarch64 (#15706) @sperlingxx - Fix maxima of categorical column (#15701) @rjzamora
- Add proxy for inplace operations in
cudf.pandas(#15695) @galipremsagar - Make
nan_as_nullbehavior consistent across all APIs (#15692) @galipremsagar - Fix CI s3 api command to fetch latest results (#15687) @galipremsagar
- Add
NumpyExtensionArrayproxy type incudf.pandas(#15686) @galipremsagar - Properly implement binaryops for proxy types (#15684) @galipremsagar
- Fix copy assignment and the comparison operator of
rmm_host_allocator(#15677) @vuule - Fix multi-source reading in JSON byte range reader (#15671) @shrshi
- Return
int64when pandas compatible mode is turned on forget_indexer(#15659) @galipremsagar - Fix Index contains for error validations and float vs int comparisons (#15657) @galipremsagar
- Preserve sub-second data for time scalars in column construction (#15655) @galipremsagar
- Check row limit size in cudf::strings::join_strings (#15643) @davidwendt
- Enable sorting on column with nulls using query-planning (#15639) @rjzamora
- Fix operator precedence problem in Parquet reader (#15638) @etseidl
- Fix decoding of dictionary encoded FIXEDLENBYTE_ARRAY data in Parquet reader (#15601) @etseidl
- Fix debug warnings/errors in fromarrowdevice_test.cpp (#15596) @davidwendt
- Add "collect" aggregation support to dask-cudf (#15593) @rjzamora
- Fix categorical-accessor support and testing in dask-cudf (#15591) @rjzamora
- Disable compute-sanitizer usage in CI tests with CUDA<11.6 (#15584) @davidwendt
- Preserve RangeIndex.step in toarrow/fromarrow (#15581) @mroeschke
- Ignore new cupy warning (#15574) @vyasr
- Add cuda-sanitizer-api dependency for test-cpp matrix 11.4 (#15573) @davidwendt
- Allow apply udf to reference global modules in cudf.pandas (#15569) @mroeschke
- Fix deprecation warnings for json legacy reader (#15563) @davidwendt
- Fix millisecond resampling in cudf Python (#15560) @mroeschke
- Rename JSONREADEROPTION to JSONREADEROPTION_NVBENCH. (#15553) @bdice
- Fix a JNI bug in JSON parsing fixup (#15550) @revans2
- Remove conda channel setup from wheel CI image script. (#15539) @bdice
- cudf.pandas: Series dt accessor is CombinedDatetimelikeProperties (#15523) @wence-
- Fix for some compiler warnings in parquet/page_decode.cuh (#15518) @etseidl
- Fix exponent overflow in strings-to-double conversion (#15517) @davidwendt
- nanoarrow uses package override for proper pinned versions generation (#15515) @robertmaynard
- Remove index name overrides in dask-cudf pyarrow table dispatch (#15514) @charlesbluca
- Fix async synchronization issues in json_column.cu (#15497) @karthikeyann
- Add new patch to hide more CCCL APIs (#15493) @vyasr
- Make improvements in pandas-test reporting (#15485) @galipremsagar
- Fixed page data truncation in parquet writer under certain conditions. (#15474) @nvdbaranec
- Only use data_type constructor with scale for decimal types (#15472) @wence-
- Avoid "p2p" shuffle as a default when
dask_cudfis imported (#15469) @rjzamora - Fix debug build errors from toarrowdevice_test.cpp (#15463) @davidwendt
- Fix basenormalator::integersizeof_fn integer dispatch (#15457) @davidwendt
- Allow consumers of static builds to find nanoarrow (#15456) @robertmaynard
- Allow jit compilation when using a splayed CUDA toolkit (#15451) @robertmaynard
- Handle case of scan aggregation in groupby-transform (#15450) @wence-
- Test static builds in CI and fix nanoarrow configure (#15437) @vyasr
- Fixes potential race in JSON parser when parsing JSON lines format and when recovering from invalid lines (#15419) @elstehle
- Fix errors in chunked ORC writer when no tables were (successfully) written (#15393) @vuule
- Support implicit array conversion with query-planning enabled (#15378) @rjzamora
- Fix arrow-based round trip of empty dataframes (#15373) @wence-
- Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
- Remove boundscheck=False setting in cython files (#15362) @wence-
- Patch dask-expr
varlogic in dask-cudf (#15347) @rjzamora - Fix for logical and syntactical errors in libcudf c++ examples (#15346) @mhaseeb123
- Disable dask-expr in docs builds. (#15343) @bdice
- Apply the cuFile error work around to data_sink as well (#15335) @vuule
- Fix parquet predicate filtering with column projection (#15113) @karthikeyann
- Check column type equality, handling nested types correctly. (#14531) @bdice
π Documentation
- Fix docs for IO readers and strings_convert (#15842) @bdice
- Update cudf.pandas docs for GA (#15744) @beckernick
- Add contributing warning about circular imports (#15691) @er-eis
- Update libcudf developer guide for strings offsets column (#15661) @davidwendt
- Update developer guide with deviceasyncresource_ref guidelines (#15562) @harrism
- DOC: add pandas intersphinx mapping (#15531) @raybellwaves
- rm-dup-doc in frame.py (#15530) @raybellwaves
- Update CONTRIBUTING.md to use latest cuda env (#15467) @raybellwaves
- Doc: interleave columns pandas compat (#15383) @raybellwaves
- Simplified README Examples (#15338) @wkaisertexas
- Add debug tips section to libcudf developer guide (#15329) @davidwendt
- Fix and clarify notes on result ordering (#13255) @shwina
π New Features
- Add JNI bindings for zstd compression of NVCOMP. (#15729) @firestarman
- Fix spaces around CSV quoted strings (#15727) @thabetx
- Add default pinned pool that falls back to new pinned allocations (#15665) @vuule
- Overhaul ops-codeowners coverage (#15660) @raydouglass
- Concatenate dictionary of objects along axis=1 (#15623) @er-eis
- Construct
pylibcudfcolumns from objects supporting__cuda_array_interface__(#15615) @brandon-b-miller - Expose some Parquet per-column configuration options via the python API (#15613) @etseidl
- Migrate string
findoperations topylibcudf(#15604) @brandon-b-miller - Round trip FIXEDLENBYTE_ARRAY data properly in Parquet writer (#15600) @etseidl
- Reading multi-line JSON in string columns using runtime configurable delimiter (#15556) @shrshi
- Remove public gtest dependency from libcudf conda package (#15534) @robertmaynard
- Fea/move to latest nanoarrow (#15526) @robertmaynard
- Migrate string
caseoperations topylibcudf(#15489) @brandon-b-miller - Add Parquet encoding statistics to column chunk metadata (#15452) @etseidl
- Implement JNI for chunked ORC reader (#15446) @ttnghia
- Add some missing optional fields to the Parquet RowGroup metadata (#15421) @etseidl
- Adding parquet transcoding example (#15420) @mhaseeb123
- Add fields to Parquet Statistics structure that were added in parquet-format 2.10 (#15412) @etseidl
- Add option to Parquet writer to skip compressing individual columns (#15411) @etseidl
- Add BYTESTREAMSPLIT support to Parquet (#15311) @etseidl
- Introduce benchmark suite for JSON reader options (#15124) @shrshi
- Implement ORC chunked reader (#15094) @ttnghia
- Extend cudf devcontainers to specify jitify2 kernel cache (#15068) @robertmaynard
- Add
to_arrow_devicefunction to cudf interop using nanoarrow (#15047) @zeroshade - Add JSON option to prune columns (#14996) @karthikeyann
π οΈ Improvements
- Deprecate
Groupby.collect(#15808) @galipremsagar - Raise FileNotFoundError when a literal JSON string that looks like a json filename is passed (#15806) @lithomas1
- Deprecate
divisions='quantile'support inset_index(#15804) @rjzamora - Improve performance of Series.tonumpy/tocupy (#15792) @mroeschke
- Access
self.indexinstead ofself._indexwhere possible (#15781) @mroeschke - Support filtered I/O in
chunked_parquet_readerand simplify the use ofparquet_reader_options(#15764) @mhaseeb123 - Avoid index-to-column conversion in some DataFrame ops (#15763) @mroeschke
- Fix
chunked_parquet_readerbehavior when input has no more rows to read (#15757) @mhaseeb123 - [JNI] Expose java API for cudf::io::confighostmemory_resource (#15745) @abellina
- Migrate all cpp pxd files into pylibcudf (#15740) @vyasr
- Validate and materialize iterators earlier in as_column (#15739) @mroeschke
- Push some ascolumn arrow logic to ColumnBase.fromarrow (#15738) @mroeschke
- Expose stream parameter in public reduction APIs (#15737) @srinivasyadav18
- remove unnecessary 'setuptools' host dependency, simplify dependencies.yaml (#15736) @jameslamb
- Defer to C++ equality and hashing for pylibcudf DataType and Aggregation objects (#15732) @wence-
- Implement null-aware NOT_EQUALS binop (#15731) @wence-
- Fix split-record result list column offset type (#15707) @davidwendt
- Upgrade
arrowto16(#15703) @galipremsagar - Remove experimental namespace from makestringschildren (#15702) @davidwendt
- Rework getjsonobject benchmark to use nvbench (#15698) @davidwendt
- Rework some python tests of Parquet delta encodings (#15693) @etseidl
- Skeleton cudf polars package (#15688) @wence-
- Upgrade pre commit hooks (#15685) @wence-
- Allow
fillnato validate forCategoricalColumn.fillna(#15683) @galipremsagar - Misc Column cleanups (#15682) @mroeschke
- Reducing runtime of JSON reader options benchmark (#15681) @shrshi
- Add
TimestampandTimedeltaproxy types (#15680) @galipremsagar - Remove hostparsenested_json. (#15674) @bdice
- Reduce runtime for ParquetChunkedReaderInputLimitTest gtests (#15672) @davidwendt
- Add large-strings gtest for cudf::interleave_columns (#15669) @davidwendt
- Use experimental makestringschildren for multi-replace_re (#15667) @davidwendt
- Enabled
Holidaytypes incudf.pandas(#15664) @galipremsagar - Remove obsolete
XFAILmarkers for query-planning (#15662) @rjzamora - Clean up join benchmarks (#15644) @PointKernel
- Enable warnings as errors in custreamz (#15642) @mroeschke
- Improve distinct join with set
retrieve(#15636) @PointKernel - Fix -Werror=type-limits. (#15635) @bdice
- Enable FutureWarnings/DeprecationWarnings as errors for dask_cudf (#15634) @mroeschke
- Remove NVBench SHA override. (#15633) @alliepiper
- Add support for large string columns to Parquet reader and writer (#15632) @etseidl
- Large strings support in MD5 and SHA hashers (#15631) @davidwendt
- Fix makeoffsetschild_column usage in cudf::strings::detail::shift (#15630) @davidwendt
- Use experimental makestringschildren for strings convert (#15629) @davidwendt
- Forward-merge branch-24.04 to branch-24.06 (#15627) @bdice
- Avoid accessing attributes via
_columnif not needed (#15624) @mroeschke - Make ColumnBase.cudaarrayinterface opt out instead of opt in (#15622) @mroeschke
- Large strings support for cudf::gather (#15621) @davidwendt
- Remove jni-docker-build workflow (#15619) @bdice
- Support
DurationTypein cudf parquet reader viaarrow:schema(#15617) @mhaseeb123 - Drop Centos7 support (#15608) @NvTimLiu
- Use experimental makestringschildren for json/csv writers (#15599) @davidwendt
- Use experimental makestringschildren for strings join/url_encode/slice (#15598) @davidwendt
- Use experimental makestringschildren in nvtext APIs (#15595) @davidwendt
- Migrate to
{{ stdlib("c") }}(#15594) @hcho3 - Deprecate
to/from_dask_dataframeAPIs in dask-cudf (#15592) @rjzamora - Minor fixups for future NumPy 2 compatibility (#15590) @seberg
- Delay materializing RangeIndex in .reset_index (#15588) @mroeschke
- Use experimental makestringschildren for capitalize/case/pad functions (#15587) @davidwendt
- Use experimental makestringschildren for strings replace/filter/translate (#15586) @davidwendt
- Add multithreaded parquet reader benchmarks. (#15585) @nvdbaranec
- Don't materialize column during RangeIndex methods (#15582) @mroeschke
- Improve performance for cudf::strings::count_re (#15578) @davidwendt
- Replace RangeIndex.start/stop/_step with _range (#15576) @mroeschke
- add --rm and --name to devcontainer run args (#15572) @trxcllnt
- Change the default dictionary policy in Parquet writer from
ALWAYStoADAPTIVE(#15570) @mhaseeb123 - Rename experimental JSON tests. (#15568) @bdice
- Refactor JNI native dependency loading to allow returning of library path (#15566) @jlowe
- Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
- Deprecate legacy JSON reader options. (#15558) @bdice
- Use same .clang-format in cuDF JNI (#15557) @bdice
- Large strings support for cudf::fill (#15555) @davidwendt
- Upgrade upper bound pinning to
pandas-2.2.2(#15554) @galipremsagar - Work around issues with cccl main (#15552) @miscco
- Enable pandas plotting unit tests for cudf.pandas (#15547) @mroeschke
- Move timezone conversion logic to
DatetimeColumn(#15545) @mroeschke - Large strings support for cudf::interleave_columns (#15544) @davidwendt
- [skip ci] Switch back to 24.06 branch for pandas tests (#15543) @galipremsagar
- Remove checks dependency from static-configure test job. (#15542) @bdice
- Remove legacy JSON reader from Python (#15538) @bdice
- Enable more ignored pandas unit tests for cudf.pandas (#15535) @mroeschke
- Large strings support for cudf::clamp (#15533) @davidwendt
- Remove version hard-coding (#15529) @galipremsagar
- Removing all batching code from parquet writer (#15528) @mhaseeb123
- Make some private class properties not settable (#15527) @mroeschke
- Large strings support in regex replace APIs (#15524) @davidwendt
- Skip pandas unit tests that crash pytest workers in
cudf.pandas(#15521) @mroeschke - Preserve column metadata during more DataFrame operations (#15519) @mroeschke
- Move to pandas-tests to a dedicated workflow file and trigger it from branch.yaml (#15516) @galipremsagar
- Large strings gtest fixture and utilities (#15513) @davidwendt
- Convert libcudf resource parameters to rmm::deviceasyncresource_ref (#15507) @harrism
- Relax protobuf lower bound to 3.20. (#15506) @bdice
- Clean up index methods (#15496) @mroeschke
- Update strings contains benchmarks to nvbench (#15495) @davidwendt
- Update NVBench fixture to use new hooks, fix pinned memory segfault. (#15492) @alliepiper
- Enable tests/scalar and test/series in cudf.pandas tests (#15486) @mroeschke
- Clean up cudaarrayinterface handling in as_column (#15477) @mroeschke
- Avoid .ordered and .categories from being settable in CategoricalColumn and CategoricalDtype (#15475) @mroeschke
- Ignore pandas tests for cudf.pandas that need motoserver (#15468) @mroeschke
- Use cachedproperty for NumericColumn.nancount instead of .nancount variable (#15466) @mroeschke
- Add toarrowdevice() functions that accept views (#15465) @davidwendt
- Add custom status check workflow (#15464) @galipremsagar
- Disable pandas 2.x clipboard tests in cudf.pandas tests (#15462) @mroeschke
- Enable tests/strings/test_api.py and tests/io/pytables in cudf.pandas tests (#15461) @mroeschke
- Enable test_parsing in cudf.pandas tests (#15460) @mroeschke
- Add
from_arrow_devicefunction to cudf interop using nanoarrow (#15458) @zeroshade - Remove deprecated strings offsets_begin (#15454) @davidwendt
- Enable tests/windows/ in cudf.pandas tests (#15444) @mroeschke
- Enable tests/interchange/test_impl.py in cudf.pandas tests (#15443) @mroeschke
- Enable tests/io/testuseragent.py in cudf pandas tests (#15442) @mroeschke
- Performance improvement in libcudf case conversion for long strings (#15441) @davidwendt
- Remove prior test skipping in run-pandas-tests with testing 2.2.1 (#15440) @mroeschke
- Support orc and text IO with dask-expr using legacy conversion (#15439) @rjzamora
- Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
- Unify Copy-On-Write and Spilling (#15436) @madsbk
- Enable
dask_cudfjson and s3 tests with query-planning on (#15408) @rjzamora - Bump ruff and codespell pre-commit checks (#15407) @mroeschke
- Enable all tests for
armarch (#15402) @galipremsagar - Bind
read_parquet_metadataAPI to libcudf instead of pyarrow and extractRowGroupinformation (#15398) @mhaseeb123 - Optimizing multi-source byte range reading in JSON reader (#15396) @shrshi
- add correct labels to pandasfunctionrequest.md (#15381) @raybellwaves
- Remove deprecated hash() and sparkmurmurhash3x86_32() (#15375) @davidwendt
- Large strings support in cudf::merge (#15374) @davidwendt
- Enable test-reporting for pandas pytests in CI (#15369) @galipremsagar
- Use logical types in Parquet reader (#15365) @etseidl
- Add experimental makestringschildren utility (#15363) @davidwendt
- Forward-merge branch-24.04 to branch-24.06 (#15349) @bdice
- Fix CMake files in libcudf C++ examples to use existing libcudf build if present (#15348) @mhaseeb123
- Use ruff pydocstyle over pydocstyle pre-commit hook (#15345) @mroeschke
- Refactor stream mode setup for gtests (#15337) @davidwendt
- Benchmark decimal <--> floating conversions. (#15334) @pmattione-nvidia
- Avoid duplicate dask-cudf testing (#15333) @rjzamora
- Skip decode steps in Parquet reader when nullable columns have no nulls (#15332) @etseidl
- Update udfcpp to use rapidscpm_cccl. (#15331) @bdice
- Forward-merge branch-24.04 into branch-24.06 skip ci @rapids-bot[bot]
- Allow
numeric_only=Truefor simple groupby reductions (#15326) @rjzamora - Drop CentOS 7 support. (#15323) @bdice
- Rework cudf::findandreplaceall to use gather-based makestrings_column (#15305) @davidwendt
- First pass at adding testing for pylibcudf (#15300) @vyasr
- [FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
- Rework cudf::replacenulls to use strings::detail::copyif_else (#15286) @davidwendt
- Clean up special casing in
as_columnfor non-typed input (#15276) @mroeschke - Large strings support in cudf::concatenate (#15195) @davidwendt
- Use less iscategorical_dtype (#15148) @mroeschke
- Align date_range defaults with pandas, support tz (#15139) @mroeschke
ModuleAcceleratorperformance: cache the result of checking if a caller is in the denylist (#15056) @shwina- Use offsetalator in cudf::strings::replace functions (#14824) @davidwendt
- Cleanup some timedelta/datetime column logic (#14715) @mroeschke
- Refactor numpy array input in as_column (#14651) @mroeschke
- Refactor joins for conditional semis and antis (#14646) @DanialJavady96
- Eagerly populate the class dict for cudf.pandas proxy types (#14534) @shwina
- Some additional kernel thread index refactoring. (#14107) @bdice
- C++
Published by raydouglass over 1 year ago
https://github.com/rapidsai/cudf - v24.06.00
π¨ Breaking Changes
- Deprecate
Groupby.collect(#15808) @galipremsagar - Raise FileNotFoundError when a literal JSON string that looks like a json filename is passed (#15806) @lithomas1
- Support filtered I/O in
chunked_parquet_readerand simplify the use ofparquet_reader_options(#15764) @mhaseeb123 - Raise errors for unsupported operations on certain types (#15712) @galipremsagar
- Support
DurationTypein cudf parquet reader viaarrow:schema(#15617) @mhaseeb123 - Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
- Remove legacy JSON reader from Python (#15538) @bdice
- Removing all batching code from parquet writer (#15528) @mhaseeb123
- Convert libcudf resource parameters to rmm::deviceasyncresource_ref (#15507) @harrism
- Remove deprecated strings offsets_begin (#15454) @davidwendt
- Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
- Bind
read_parquet_metadataAPI to libcudf instead of pyarrow and extractRowGroupinformation (#15398) @mhaseeb123 - Remove deprecated hash() and sparkmurmurhash3x86_32() (#15375) @davidwendt
- Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
- [FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
- Align date_range defaults with pandas, support tz (#15139) @mroeschke
π Bug Fixes
- Revert "Fix docs for IO readers and strings_convert" (#15872) @vyasr
- Remove problematic call of index setter to unblock dask-cuda CI (#15844) @charlesbluca
- Use rapidscpmnvtx3 to get same nvtx3 target state as rmm (#15840) @robertmaynard
- Return boolean from confighostmemory_resource instead of throwing (#15815) @abellina
- Add temporary dask-cudf workaround for categorical sorting (#15801) @rjzamora
- Fix row group alignment in ORC writer (#15789) @vuule
- Raise error when sorting by categorical column in dask-cudf (#15788) @rjzamora
- Upgrade
arrowto 16.1 (#15787) @galipremsagar - Add support for
PandasArrayforpandas<2.1.0(#15786) @galipremsagar - Limit runtime dependency to
libarrow>=16.0.0,<16.1.0a0(#15782) @pentschev - Fix cat.as_ordered not propogating correct size (#15780) @mroeschke
- Handle mixed-like homogeneous types in
isin(#15771) @galipremsagar - Fix idvars and valuevars not accepting string scalars in melt (#15765) @mroeschke
- Fix
DatetimeIndex.locfor all types of ordering cases (#15761) @galipremsagar - Fix arrow versioning logic (#15755) @vyasr
- Avoid running sanitizer on Java test designed to cause an error (#15753) @jlowe
- Handle empty dataframe object with index present in setitem of
loc(#15752) @galipremsagar - Eliminate circular reference in DataFrame/Series.iloc/loc (#15749) @mroeschke
- Cap the absolute row index per pass in parquet chunked reader. (#15735) @nvdbaranec
- Fix
Index.repeatfordatetime64types (#15722) @galipremsagar - Fix multibyte check for case convert for large strings (#15721) @davidwendt
- Fix
get_locto properly fetch results from an index that is in decreasing order (#15719) @galipremsagar - Return same type as the original index for
.locoperations (#15717) @galipremsagar - Correct static builds + static arrow (#15715) @robertmaynard
- Raise errors for unsupported operations on certain types (#15712) @galipremsagar
- Fix ColumnAccessor caching of nrows if empty previously (#15710) @mroeschke
- Allow
Nonewhennan_as_null=Falsein column constructor (#15709) @galipremsagar - Refine
CudaTest.testCudaExceptionin case throwing wrong type of CudaError under aarch64 (#15706) @sperlingxx - Fix maxima of categorical column (#15701) @rjzamora
- Add proxy for inplace operations in
cudf.pandas(#15695) @galipremsagar - Make
nan_as_nullbehavior consistent across all APIs (#15692) @galipremsagar - Fix CI s3 api command to fetch latest results (#15687) @galipremsagar
- Add
NumpyExtensionArrayproxy type incudf.pandas(#15686) @galipremsagar - Properly implement binaryops for proxy types (#15684) @galipremsagar
- Fix copy assignment and the comparison operator of
rmm_host_allocator(#15677) @vuule - Fix multi-source reading in JSON byte range reader (#15671) @shrshi
- Return
int64when pandas compatible mode is turned on forget_indexer(#15659) @galipremsagar - Fix Index contains for error validations and float vs int comparisons (#15657) @galipremsagar
- Preserve sub-second data for time scalars in column construction (#15655) @galipremsagar
- Check row limit size in cudf::strings::join_strings (#15643) @davidwendt
- Enable sorting on column with nulls using query-planning (#15639) @rjzamora
- Fix operator precedence problem in Parquet reader (#15638) @etseidl
- Fix decoding of dictionary encoded FIXEDLENBYTE_ARRAY data in Parquet reader (#15601) @etseidl
- Fix debug warnings/errors in fromarrowdevice_test.cpp (#15596) @davidwendt
- Add "collect" aggregation support to dask-cudf (#15593) @rjzamora
- Fix categorical-accessor support and testing in dask-cudf (#15591) @rjzamora
- Disable compute-sanitizer usage in CI tests with CUDA<11.6 (#15584) @davidwendt
- Preserve RangeIndex.step in toarrow/fromarrow (#15581) @mroeschke
- Ignore new cupy warning (#15574) @vyasr
- Add cuda-sanitizer-api dependency for test-cpp matrix 11.4 (#15573) @davidwendt
- Allow apply udf to reference global modules in cudf.pandas (#15569) @mroeschke
- Fix deprecation warnings for json legacy reader (#15563) @davidwendt
- Fix millisecond resampling in cudf Python (#15560) @mroeschke
- Rename JSONREADEROPTION to JSONREADEROPTION_NVBENCH. (#15553) @bdice
- Fix a JNI bug in JSON parsing fixup (#15550) @revans2
- Remove conda channel setup from wheel CI image script. (#15539) @bdice
- cudf.pandas: Series dt accessor is CombinedDatetimelikeProperties (#15523) @wence-
- Fix for some compiler warnings in parquet/page_decode.cuh (#15518) @etseidl
- Fix exponent overflow in strings-to-double conversion (#15517) @davidwendt
- nanoarrow uses package override for proper pinned versions generation (#15515) @robertmaynard
- Remove index name overrides in dask-cudf pyarrow table dispatch (#15514) @charlesbluca
- Fix async synchronization issues in json_column.cu (#15497) @karthikeyann
- Add new patch to hide more CCCL APIs (#15493) @vyasr
- Make improvements in pandas-test reporting (#15485) @galipremsagar
- Fixed page data truncation in parquet writer under certain conditions. (#15474) @nvdbaranec
- Only use data_type constructor with scale for decimal types (#15472) @wence-
- Avoid "p2p" shuffle as a default when
dask_cudfis imported (#15469) @rjzamora - Fix debug build errors from toarrowdevice_test.cpp (#15463) @davidwendt
- Fix basenormalator::integersizeof_fn integer dispatch (#15457) @davidwendt
- Allow consumers of static builds to find nanoarrow (#15456) @robertmaynard
- Allow jit compilation when using a splayed CUDA toolkit (#15451) @robertmaynard
- Handle case of scan aggregation in groupby-transform (#15450) @wence-
- Test static builds in CI and fix nanoarrow configure (#15437) @vyasr
- Fixes potential race in JSON parser when parsing JSON lines format and when recovering from invalid lines (#15419) @elstehle
- Fix errors in chunked ORC writer when no tables were (successfully) written (#15393) @vuule
- Support implicit array conversion with query-planning enabled (#15378) @rjzamora
- Fix arrow-based round trip of empty dataframes (#15373) @wence-
- Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
- Remove boundscheck=False setting in cython files (#15362) @wence-
- Patch dask-expr
varlogic in dask-cudf (#15347) @rjzamora - Fix for logical and syntactical errors in libcudf c++ examples (#15346) @mhaseeb123
- Disable dask-expr in docs builds. (#15343) @bdice
- Apply the cuFile error work around to data_sink as well (#15335) @vuule
- Fix parquet predicate filtering with column projection (#15113) @karthikeyann
- Check column type equality, handling nested types correctly. (#14531) @bdice
π Documentation
- Fix docs for IO readers and strings_convert (#15842) @bdice
- Update cudf.pandas docs for GA (#15744) @beckernick
- Add contributing warning about circular imports (#15691) @er-eis
- Update libcudf developer guide for strings offsets column (#15661) @davidwendt
- Update developer guide with deviceasyncresource_ref guidelines (#15562) @harrism
- DOC: add pandas intersphinx mapping (#15531) @raybellwaves
- rm-dup-doc in frame.py (#15530) @raybellwaves
- Update CONTRIBUTING.md to use latest cuda env (#15467) @raybellwaves
- Doc: interleave columns pandas compat (#15383) @raybellwaves
- Simplified README Examples (#15338) @wkaisertexas
- Add debug tips section to libcudf developer guide (#15329) @davidwendt
- Fix and clarify notes on result ordering (#13255) @shwina
π New Features
- Add JNI bindings for zstd compression of NVCOMP. (#15729) @firestarman
- Fix spaces around CSV quoted strings (#15727) @thabetx
- Add default pinned pool that falls back to new pinned allocations (#15665) @vuule
- Overhaul ops-codeowners coverage (#15660) @raydouglass
- Concatenate dictionary of objects along axis=1 (#15623) @er-eis
- Construct
pylibcudfcolumns from objects supporting__cuda_array_interface__(#15615) @brandon-b-miller - Expose some Parquet per-column configuration options via the python API (#15613) @etseidl
- Migrate string
findoperations topylibcudf(#15604) @brandon-b-miller - Round trip FIXEDLENBYTE_ARRAY data properly in Parquet writer (#15600) @etseidl
- Reading multi-line JSON in string columns using runtime configurable delimiter (#15556) @shrshi
- Remove public gtest dependency from libcudf conda package (#15534) @robertmaynard
- Fea/move to latest nanoarrow (#15526) @robertmaynard
- Migrate string
caseoperations topylibcudf(#15489) @brandon-b-miller - Add Parquet encoding statistics to column chunk metadata (#15452) @etseidl
- Implement JNI for chunked ORC reader (#15446) @ttnghia
- Add some missing optional fields to the Parquet RowGroup metadata (#15421) @etseidl
- Adding parquet transcoding example (#15420) @mhaseeb123
- Add fields to Parquet Statistics structure that were added in parquet-format 2.10 (#15412) @etseidl
- Add option to Parquet writer to skip compressing individual columns (#15411) @etseidl
- Add BYTESTREAMSPLIT support to Parquet (#15311) @etseidl
- Introduce benchmark suite for JSON reader options (#15124) @shrshi
- Implement ORC chunked reader (#15094) @ttnghia
- Extend cudf devcontainers to specify jitify2 kernel cache (#15068) @robertmaynard
- Add
to_arrow_devicefunction to cudf interop using nanoarrow (#15047) @zeroshade - Add JSON option to prune columns (#14996) @karthikeyann
π οΈ Improvements
- Deprecate
Groupby.collect(#15808) @galipremsagar - Raise FileNotFoundError when a literal JSON string that looks like a json filename is passed (#15806) @lithomas1
- Deprecate
divisions='quantile'support inset_index(#15804) @rjzamora - Improve performance of Series.tonumpy/tocupy (#15792) @mroeschke
- Access
self.indexinstead ofself._indexwhere possible (#15781) @mroeschke - Support filtered I/O in
chunked_parquet_readerand simplify the use ofparquet_reader_options(#15764) @mhaseeb123 - Avoid index-to-column conversion in some DataFrame ops (#15763) @mroeschke
- Fix
chunked_parquet_readerbehavior when input has no more rows to read (#15757) @mhaseeb123 - [JNI] Expose java API for cudf::io::confighostmemory_resource (#15745) @abellina
- Migrate all cpp pxd files into pylibcudf (#15740) @vyasr
- Validate and materialize iterators earlier in as_column (#15739) @mroeschke
- Push some ascolumn arrow logic to ColumnBase.fromarrow (#15738) @mroeschke
- Expose stream parameter in public reduction APIs (#15737) @srinivasyadav18
- remove unnecessary 'setuptools' host dependency, simplify dependencies.yaml (#15736) @jameslamb
- Defer to C++ equality and hashing for pylibcudf DataType and Aggregation objects (#15732) @wence-
- Implement null-aware NOT_EQUALS binop (#15731) @wence-
- Fix split-record result list column offset type (#15707) @davidwendt
- Upgrade
arrowto16(#15703) @galipremsagar - Remove experimental namespace from makestringschildren (#15702) @davidwendt
- Rework getjsonobject benchmark to use nvbench (#15698) @davidwendt
- Rework some python tests of Parquet delta encodings (#15693) @etseidl
- Skeleton cudf polars package (#15688) @wence-
- Upgrade pre commit hooks (#15685) @wence-
- Allow
fillnato validate forCategoricalColumn.fillna(#15683) @galipremsagar - Misc Column cleanups (#15682) @mroeschke
- Reducing runtime of JSON reader options benchmark (#15681) @shrshi
- Add
TimestampandTimedeltaproxy types (#15680) @galipremsagar - Remove hostparsenested_json. (#15674) @bdice
- Reduce runtime for ParquetChunkedReaderInputLimitTest gtests (#15672) @davidwendt
- Add large-strings gtest for cudf::interleave_columns (#15669) @davidwendt
- Use experimental makestringschildren for multi-replace_re (#15667) @davidwendt
- Enabled
Holidaytypes incudf.pandas(#15664) @galipremsagar - Remove obsolete
XFAILmarkers for query-planning (#15662) @rjzamora - Clean up join benchmarks (#15644) @PointKernel
- Enable warnings as errors in custreamz (#15642) @mroeschke
- Improve distinct join with set
retrieve(#15636) @PointKernel - Fix -Werror=type-limits. (#15635) @bdice
- Enable FutureWarnings/DeprecationWarnings as errors for dask_cudf (#15634) @mroeschke
- Remove NVBench SHA override. (#15633) @alliepiper
- Add support for large string columns to Parquet reader and writer (#15632) @etseidl
- Large strings support in MD5 and SHA hashers (#15631) @davidwendt
- Fix makeoffsetschild_column usage in cudf::strings::detail::shift (#15630) @davidwendt
- Use experimental makestringschildren for strings convert (#15629) @davidwendt
- Forward-merge branch-24.04 to branch-24.06 (#15627) @bdice
- Avoid accessing attributes via
_columnif not needed (#15624) @mroeschke - Make ColumnBase.cudaarrayinterface opt out instead of opt in (#15622) @mroeschke
- Large strings support for cudf::gather (#15621) @davidwendt
- Remove jni-docker-build workflow (#15619) @bdice
- Support
DurationTypein cudf parquet reader viaarrow:schema(#15617) @mhaseeb123 - Drop Centos7 support (#15608) @NvTimLiu
- Use experimental makestringschildren for json/csv writers (#15599) @davidwendt
- Use experimental makestringschildren for strings join/url_encode/slice (#15598) @davidwendt
- Use experimental makestringschildren in nvtext APIs (#15595) @davidwendt
- Migrate to
{{ stdlib("c") }}(#15594) @hcho3 - Deprecate
to/from_dask_dataframeAPIs in dask-cudf (#15592) @rjzamora - Minor fixups for future NumPy 2 compatibility (#15590) @seberg
- Delay materializing RangeIndex in .reset_index (#15588) @mroeschke
- Use experimental makestringschildren for capitalize/case/pad functions (#15587) @davidwendt
- Use experimental makestringschildren for strings replace/filter/translate (#15586) @davidwendt
- Add multithreaded parquet reader benchmarks. (#15585) @nvdbaranec
- Don't materialize column during RangeIndex methods (#15582) @mroeschke
- Improve performance for cudf::strings::count_re (#15578) @davidwendt
- Replace RangeIndex.start/stop/_step with _range (#15576) @mroeschke
- add --rm and --name to devcontainer run args (#15572) @trxcllnt
- Change the default dictionary policy in Parquet writer from
ALWAYStoADAPTIVE(#15570) @mhaseeb123 - Rename experimental JSON tests. (#15568) @bdice
- Refactor JNI native dependency loading to allow returning of library path (#15566) @jlowe
- Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
- Deprecate legacy JSON reader options. (#15558) @bdice
- Use same .clang-format in cuDF JNI (#15557) @bdice
- Large strings support for cudf::fill (#15555) @davidwendt
- Upgrade upper bound pinning to
pandas-2.2.2(#15554) @galipremsagar - Work around issues with cccl main (#15552) @miscco
- Enable pandas plotting unit tests for cudf.pandas (#15547) @mroeschke
- Move timezone conversion logic to
DatetimeColumn(#15545) @mroeschke - Large strings support for cudf::interleave_columns (#15544) @davidwendt
- [skip ci] Switch back to 24.06 branch for pandas tests (#15543) @galipremsagar
- Remove checks dependency from static-configure test job. (#15542) @bdice
- Remove legacy JSON reader from Python (#15538) @bdice
- Enable more ignored pandas unit tests for cudf.pandas (#15535) @mroeschke
- Large strings support for cudf::clamp (#15533) @davidwendt
- Remove version hard-coding (#15529) @galipremsagar
- Removing all batching code from parquet writer (#15528) @mhaseeb123
- Make some private class properties not settable (#15527) @mroeschke
- Large strings support in regex replace APIs (#15524) @davidwendt
- Skip pandas unit tests that crash pytest workers in
cudf.pandas(#15521) @mroeschke - Preserve column metadata during more DataFrame operations (#15519) @mroeschke
- Move to pandas-tests to a dedicated workflow file and trigger it from branch.yaml (#15516) @galipremsagar
- Large strings gtest fixture and utilities (#15513) @davidwendt
- Convert libcudf resource parameters to rmm::deviceasyncresource_ref (#15507) @harrism
- Relax protobuf lower bound to 3.20. (#15506) @bdice
- Clean up index methods (#15496) @mroeschke
- Update strings contains benchmarks to nvbench (#15495) @davidwendt
- Update NVBench fixture to use new hooks, fix pinned memory segfault. (#15492) @alliepiper
- Enable tests/scalar and test/series in cudf.pandas tests (#15486) @mroeschke
- Clean up cudaarrayinterface handling in as_column (#15477) @mroeschke
- Avoid .ordered and .categories from being settable in CategoricalColumn and CategoricalDtype (#15475) @mroeschke
- Ignore pandas tests for cudf.pandas that need motoserver (#15468) @mroeschke
- Use cachedproperty for NumericColumn.nancount instead of .nancount variable (#15466) @mroeschke
- Add toarrowdevice() functions that accept views (#15465) @davidwendt
- Add custom status check workflow (#15464) @galipremsagar
- Disable pandas 2.x clipboard tests in cudf.pandas tests (#15462) @mroeschke
- Enable tests/strings/test_api.py and tests/io/pytables in cudf.pandas tests (#15461) @mroeschke
- Enable test_parsing in cudf.pandas tests (#15460) @mroeschke
- Add
from_arrow_devicefunction to cudf interop using nanoarrow (#15458) @zeroshade - Remove deprecated strings offsets_begin (#15454) @davidwendt
- Enable tests/windows/ in cudf.pandas tests (#15444) @mroeschke
- Enable tests/interchange/test_impl.py in cudf.pandas tests (#15443) @mroeschke
- Enable tests/io/testuseragent.py in cudf pandas tests (#15442) @mroeschke
- Performance improvement in libcudf case conversion for long strings (#15441) @davidwendt
- Remove prior test skipping in run-pandas-tests with testing 2.2.1 (#15440) @mroeschke
- Support orc and text IO with dask-expr using legacy conversion (#15439) @rjzamora
- Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
- Unify Copy-On-Write and Spilling (#15436) @madsbk
- Enable
dask_cudfjson and s3 tests with query-planning on (#15408) @rjzamora - Bump ruff and codespell pre-commit checks (#15407) @mroeschke
- Enable all tests for
armarch (#15402) @galipremsagar - Bind
read_parquet_metadataAPI to libcudf instead of pyarrow and extractRowGroupinformation (#15398) @mhaseeb123 - Optimizing multi-source byte range reading in JSON reader (#15396) @shrshi
- add correct labels to pandasfunctionrequest.md (#15381) @raybellwaves
- Remove deprecated hash() and sparkmurmurhash3x86_32() (#15375) @davidwendt
- Large strings support in cudf::merge (#15374) @davidwendt
- Enable test-reporting for pandas pytests in CI (#15369) @galipremsagar
- Use logical types in Parquet reader (#15365) @etseidl
- Add experimental makestringschildren utility (#15363) @davidwendt
- Forward-merge branch-24.04 to branch-24.06 (#15349) @bdice
- Fix CMake files in libcudf C++ examples to use existing libcudf build if present (#15348) @mhaseeb123
- Use ruff pydocstyle over pydocstyle pre-commit hook (#15345) @mroeschke
- Refactor stream mode setup for gtests (#15337) @davidwendt
- Benchmark decimal <--> floating conversions. (#15334) @pmattione-nvidia
- Avoid duplicate dask-cudf testing (#15333) @rjzamora
- Skip decode steps in Parquet reader when nullable columns have no nulls (#15332) @etseidl
- Update udfcpp to use rapidscpm_cccl. (#15331) @bdice
- Forward-merge branch-24.04 into branch-24.06 skip ci @rapids-bot[bot]
- Allow
numeric_only=Truefor simple groupby reductions (#15326) @rjzamora - Drop CentOS 7 support. (#15323) @bdice
- Rework cudf::findandreplaceall to use gather-based makestrings_column (#15305) @davidwendt
- First pass at adding testing for pylibcudf (#15300) @vyasr
- [FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
- Rework cudf::replacenulls to use strings::detail::copyif_else (#15286) @davidwendt
- Clean up special casing in
as_columnfor non-typed input (#15276) @mroeschke - Large strings support in cudf::concatenate (#15195) @davidwendt
- Use less iscategorical_dtype (#15148) @mroeschke
- Align date_range defaults with pandas, support tz (#15139) @mroeschke
ModuleAcceleratorperformance: cache the result of checking if a caller is in the denylist (#15056) @shwina- Use offsetalator in cudf::strings::replace functions (#14824) @davidwendt
- Cleanup some timedelta/datetime column logic (#14715) @mroeschke
- Refactor numpy array input in as_column (#14651) @mroeschke
- Refactor joins for conditional semis and antis (#14646) @DanialJavady96
- Eagerly populate the class dict for cudf.pandas proxy types (#14534) @shwina
- Some additional kernel thread index refactoring. (#14107) @bdice
- C++
Published by raydouglass over 1 year ago
https://github.com/rapidsai/cudf - v24.04.01
π¨ Breaking Changes
- Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
- Change exceptions thrown by copying APIs (#15319) @vyasr
- Change stringscolumnview::char_size to return int64 (#15197) @davidwendt
- Upgrade to
arrow-14.0.2(#15108) @galipremsagar - Add support for
pandas-2.2incudf(#15100) @galipremsagar - Deprecate cudf::hashing::sparkmurmurhash3x86_32 (#15074) @davidwendt
- Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
- Raise an error on import for unsupported GPUs. (#15053) @bdice
- Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
- Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
- Add
future_stacktoDataFrame.stack(#15015) @galipremsagar - Deprecate groupby fillna (#15000) @mroeschke
- Deprecate replace with categorical columns (#14988) @mroeschke
- Deprecate delimwhitespace in readcsv for pandas 2.2 (#14986) @mroeschke
- Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
- Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
- Add
pandas-2.xsupport incudf(#14916) @galipremsagar - Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
π Bug Fixes
- Fix an issue with creating a series from scalar when
dtype='category'(#15476) @galipremsagar - Update pre-commit-hooks to v0.0.3 (#15355) @KyleFromNVIDIA
- [BUG][JNI] Trigger MemoryBuffer.onClosed after memory is freed (#15351) @abellina
- Fix an issue with multiple short list rowgroups using the Parquet chunked reader. (#15342) @nvdbaranec
- Avoid importing dask-expr if "query-planning" config is
False(#15340) @rjzamora - Fix gtests/ERROR_TEST errors when run in Debug (#15317) @davidwendt
- Fix OOB read in
inflate_kernel(#15309) @vuule - Work around a cuFile error when running CSV tests with memcheck (#15293) @vuule
- Fix Doxygen upload directory (#15291) @KyleFromNVIDIA
- Fix Doxygen check (#15289) @KyleFromNVIDIA
- Reintroduce PANDASGE220 import (#15287) @wence-
- Fix mean computation for the geometric distribution in the data generator (#15282) @vuule
- Fix Parquet decimal64 stats (#15281) @etseidl
- Make linking of nvtx3-cpp BUILDLOCALINTERFACE (#15271) @KyleFromNVIDIA
- Workaround compute-sanitizer memcheck bug (#15259) @davidwendt
- Cleanup
hostdevice_vectorand add more APIs (#15252) @ttnghia - Fix number of rows in randomly generated lists columns (#15248) @vuule
- Fix wrong output for
collect_list/collect_setof lists column (#15243) @ttnghia - Fix testchunkedPackTwoPasses to copy from the bounce buffer (#15220) @abellina
- Fix accessing
.columnsby an external API (#15212) @galipremsagar - [JNI] Disable testChunkedPackTwoPasses for now (#15210) @abellina
- Update labeler and codeowner configs for CMake files (#15208) @PointKernel
- Avoid dict normalization in
__dask_tokenize__(#15187) @rjzamora - Fix memcheck error in distinct inner join (#15164) @PointKernel
- Remove unneeded script parameters in testcppmemcheck.sh (#15158) @davidwendt
- Fix
ListColumn.to_pandas()to retainlisttype (#15155) @galipremsagar - Avoid factorization in MultiIndex.to_pandas (#15150) @mroeschke
- Fix GroupBy.get_group and GroupBy.indices (#15143) @wence-
- Remove
constfromrange_window_bounds::_extent. (#15138) @mythrocks - DataFrame.columns = ... retains RangeIndex & set dtype (#15129) @mroeschke
- Correctly handle output for
GroupBy.applywhen chunk results are reindexed series (#15109) @brandon-b-miller - Fix Series.groupby.shift with a MultiIndex (#15098) @mroeschke
- Fix reductions when DataFrame has MulitIndex columns (#15097) @mroeschke
- Fix deprecation warnings for deprecated hash() calls (#15095) @davidwendt
- Add support for arrow
large_stringincudf(#15093) @galipremsagar - Fix
sort_valuespytest failure with pandas-2.x regression (#15092) @galipremsagar - Resolve path parsing issues in
get_json_object(#15082) @SurajAralihalli - Fix bugs in handling of delta encodings (#15075) @etseidl
- Fix
is_device_write_preferredinvoid_sinkanduser_sink_wrapper(#15064) @vuule - Eliminate duplicate allocation of nested string columns (#15061) @vuule
- Raise an error on import for unsupported GPUs. (#15053) @bdice
- Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
- Fix
Index.differenceto handle duplicate values when one of the inputs is empty (#15016) @galipremsagar - Add
future_stacktoDataFrame.stack(#15015) @galipremsagar - Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
- Fix
DataFrame.sort_indexto respectignore_indexon all axis (#14995) @galipremsagar - Raise for pyarrow array that is tz-aware (#14980) @mroeschke
- Direct
SeriesGroupBy.aggregatetoSeriesGroupBy.agg(#14971) @rjzamora - Respect IntervalDtype and CategoricalDtype objects passed by users (#14961) @mroeschke
- unset
CUDF_SPILLafter a pytest (#14958) @galipremsagar - Fix Null literals to be not parsed as string when mixed types as string is enabled in JSON reader (#14939) @karthikeyann
- Fix chunked reads of Parquet delta encoded pages (#14921) @etseidl
- Fix reading offset for data stream in ORC reader (#14911) @ttnghia
- Enable sanitizer check for a test case testORCReadAndWriteForDecimal128 (#14897) @res-life
- Fix dask token normalization (#14829) @rjzamora
- Fix 24.04 versions (#14825) @raydouglass
- Ensure slow private attrs are maybe proxies (#14380) @mroeschke
π Documentation
- Ignore DLManagedTensor in the docs build (#15392) @davidwendt
- Revert "Temporarily disable docs errors. (#15265)" (#15269) @bdice
- Temporarily disable docs errors. (#15265) @bdice
- Update
developer_guide.mdwith new guidance on quoted internal includes (#15238) @harrism - Fix broken link for developer guide (#15025) @sanjana098
- [DOC] Update typo in docs example of structscolumnwrapper (#14949) @karthikeyann
- Update cudf.pandas FAQ. (#14940) @bdice
- Optimize doc builds (#14856) @vyasr
- Add developer guideline to use east const. (#14836) @bdice
- Document how cuDF is pronounced (#14753) @pentschev
- Notes convert to Pandas-compat (#12641) @Touutae-lab
π New Features
- Address inconsistency in single quote normalization in JSON reader (#15324) @shrshi
- Use JNI pinned pool resource with cuIO (#15255) @abellina
- Add DELTABYTEARRAY encoder for Parquet (#15239) @etseidl
- Migrate filling operations to pylibcudf (#15225) @brandon-b-miller
- [JNI] rmm based pinned pool (#15219) @abellina
- Implement zero-copy host buffer source instead of using an arrow implementation (#15189) @vuule
- Enable creation of columns from scalar (#15181) @vyasr
- Use NVTX from GitHub. (#15178) @bdice
- Implement
segmented_row_bit_countfor computing row sizes by segments of rows (#15169) @ttnghia - Implement search using pylibcudf (#15166) @vyasr
- Add distinct left join (#15149) @PointKernel
- Add cardinality control for groupby benchs with flat types (#15134) @PointKernel
- Add ability to request Parquet encodings on a per-column basis (#15081) @etseidl
- Automate include grouping order in .clang-format (#15063) @harrism
- Requesting a clean build directory also clears Jitify cache (#15052) @robertmaynard
- API for JSON unquoted whitespace normalization (#15033) @shrshi
- Implement concatenate, lists.explode, merge, sorting, and stream compaction in pylibcudf (#15011) @vyasr
- Implement replace in pylibcudf (#15005) @vyasr
- Add distinct key inner join (#14990) @PointKernel
- Implement rolling in pylibcudf (#14982) @vyasr
- Implement joins in pylibcudf (#14972) @vyasr
- Implement scans and reductions in pylibcudf (#14970) @vyasr
- Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
- Implement groupby in pylibcudf (#14945) @vyasr
- Support casting of Map type to string in JSON reader (#14936) @karthikeyann
- POC for whitespace removal in input JSON data using FST (#14931) @shrshi
- Support for LZ4 compression in ORC and Parquet (#14906) @vuule
- Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
- Migrate unary operations to pylibcudf (#14850) @vyasr
- Migrate binary operations to pylibcudf (#14821) @vyasr
- Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
- Support CUDA 12.2 (#14712) @jameslamb
π οΈ Improvements
- Backport: Relax protobuf lower bound to 3.20. (#15506) (#15610) @bdice
- Use
conda env create --yesinstead of--force(#15403) @bdice - Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
- Change exceptions thrown by copying APIs (#15319) @vyasr
- Enable branch testing for
cudf.pandas(#15316) @galipremsagar - Replace black with ruff-format (#15312) @mroeschke
- This fixes an NPE when trying to read empty JSON data by adding a new API for missing information (#15307) @revans2
- Address poor performance of Parquet string decoding (#15304) @etseidl
- Update script input name (#15301) @AyodeAwe
- Make testreadparquetpartitionedfiltered data deterministic (#15296) @mroeschke
- Add timeout for
cudf.pandaspandas tests (#15284) @galipremsagar - Add upper bound to prevent usage of NumPy 2 (#15283) @bdice
- Fix cudf::test::tohost return of hostvector (#15263) @davidwendt
- Implement grouped product scan (#15254) @wence-
- Add CUDA 12.4 to supported PTX versions (#15247) @brandon-b-miller
- Implement DataFrame|Series.squeeze (#15244) @mroeschke
- Roll back ipow changes due to register pressure. (#15242) @pmattione-nvidia
- Remove createcharschild_column utility (#15241) @davidwendt
- Update dlpack to version 0.8 (#15237) @dantegd
- Improve performance in JSON reader when
mixed_types_as_stringoption is enabled (#15236) @shrshi - Remove row conversion code from libcudf (#15234) @ttnghia
- Use variable substitution for RAPIDS version in Doxyfile (#15231) @KyleFromNVIDIA
- Add ListColumns.topandas(arrowtype=) (#15228) @mroeschke
- Treat dask-cudf CI artifacts as pure wheels (#15223) @bdice
- Clean up usage of CUDA_ARCH and other macros. (#15218) @bdice
- DOC: use constants in performance-comparisons.ipynb (#15215) @raybellwaves
- Rewrite conversion in terms of column (#15213) @vyasr
- Switch
pytest-xdistalgo toworksteal(#15207) @galipremsagar - Deprecate stringscolumnview::offsets_begin() (#15205) @davidwendt
- Add
get_upstream_resourcemethod tostream_checking_resource_adaptor(#15203) @miscco - Tune up row size estimation in the data generator (#15202) @vuule
- Fix
offsetvalue for generating test data inparquet_chunked_reader_test.cu(#15200) @ttnghia - Change stringscolumnview::char_size to return int64 (#15197) @davidwendt
- Fix includes for row_operators.cuh (#15194) @davidwendt
- Generalize GHA selectors for pure Python testing (#15191) @bdice
- Improvements for
__cuda_array_interface__tests (#15188) @bdice - Allow to_pandas to return pandas.ArrowDtype (#15182) @mroeschke
- Ignore
byte_rangeinread_jsonwhen the size is not smaller than the input data (#15180) @vuule - Expose new stablesort and finish streamcompaction in pylibcudf (#15175) @wence-
- [ci] update matrix filters for dask-cudf builds (#15174) @jameslamb
- Change makestringschildren to return uvector (#15171) @davidwendt
- Don't override to_pandas for Datelike columns (#15167) @mroeschke
- Drop python-snappy from dependencies. (#15161) @bdice
- Add microkernels for fixed-width and fixed-width dictionary in Parquet decode (#15159) @abellina
- Make HostColumnVector.DataType accessor methods public (#15157) @jbrennan333
- Java bindings for left outer distinct join (#15154) @jlowe
- Forward-merge branch-24.02 to branch-24.04 (#15153) @bdice
- Enable pandas pytests for
cudf.pandas(#15147) @galipremsagar - Add java option to keep quotes for JSON reads (#15146) @revans2
- Change cross-pandas-version testing in
cudf(#15145) @galipremsagar - Use
hostdevice_vectorinkernel_errorto avoid the pageable copy (#15140) @vuule - Clean up Columns.astype & cudf.dtype (#15125) @mroeschke
- Simplify some to_pandas implementations (#15123) @mroeschke
- Java: Add leak tracking for Scalar instances (#15121) @jlowe
- Remove calls to stringscolumnview::offsets_begin() (#15112) @davidwendt
- Add support for Python 3.11, require NumPy 1.23+ (#15111) @jameslamb
- Compile-time ipow computation with array lookup (#15110) @pmattione-nvidia
- Upgrade to
arrow-14.0.2(#15108) @galipremsagar - Dynamically set version in RAPIDS doc builds (#15101) @jakirkham
- Add support for
pandas-2.2incudf(#15100) @galipremsagar - Update devcontainers to CUDA Toolkit 12.2 (#15099) @trxcllnt
- Fix
datetimebinop pytest failures in pandas-2.2 (#15090) @galipremsagar - Validate types in pylibcudf Column/Table constructors (#15088) @wence-
- xfail testjoinorderingpandascompat for pandas 2.2 (#15080) @mroeschke
- Add general purpose host memory allocator reference to cuIO with a demo of pooled-pinned allocation. (#15079) @nvdbaranec
- Adjust test_binops for pandas 2.2 (#15078) @mroeschke
- Remove offsetsbegin() call from nvtext::generatengrams (#15077) @davidwendt
- Use offsetalator in cudf::detail::hasnonemptynull_rows (#15076) @davidwendt
- Deprecate cudf::hashing::sparkmurmurhash3x86_32 (#15074) @davidwendt
- Fix cudf::test::to_host to handle both offset types for strings columns (#15073) @davidwendt
- Add condition for testgroupbynulls_basic in pandas 2.2 (#15072) @mroeschke
- xfail tests in testudfmasked_ops due to pandas 2.2 bug (#15071) @mroeschke
- target branch-24.04 for GitHub Actions workflows (#15069) @jameslamb
- Implement stable version of
cudf::sort(#15066) @wence- - Fix ORC and JSON tests failures for pandas 2.2 (#15062) @mroeschke
- Adjust test_joining for pandas 2.2 (#15060) @mroeschke
- Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
- Fix test_resample index dtype checking for pandas 2.2 (#15058) @mroeschke
- Split out strings/replace.cu and rework its gtests (#15054) @davidwendt
- Avoid incompatible value type setting in test_rolling for pandas 2.2 (#15050) @mroeschke
- Change chained replace inplace test to COW test for pandas 2.2 (#15049) @mroeschke
- Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
- Avoid chained indexing in test_indexing for pandas 2.2 (#15045) @mroeschke
- Avoid pandas 2.2
DeprecationWarningin test_hdf (#15044) @mroeschke - Use appropriate makeoffsetschild_column for building lists columns (#15043) @davidwendt
- Factor out position-offsets logic from strings split_helper utility (#15040) @davidwendt
- Forward-merge branch-24.02 to branch-24.04 (#15039) @bdice
- Clean up nvtx macros (#15038) @PointKernel
- Add xfailures for test_applymap for pandas 2.2 (#15034) @mroeschke
- Expose libcudf filter expression in read_parquet (#15028) @wence-
- Adjust tests in test_dataframe.py for pandas 2.2 (#15023) @mroeschke
- Adjust testdatetimeinfer_format for pandas 2.2 (#15021) @mroeschke
- Performance optimizations for parquet sub-rowgroup reader. (#15020) @nvdbaranec
- JNI bindings for distincthashjoin (#15019) @jlowe
- Change copyifsafe to call thrust instead of the overload function (#15018) @davidwendt
- Improve performance of copyifelse for long strings (#15017) @davidwendt
- Fix isstringdtype test for pandas 2.2 (#15012) @mroeschke
- Rework cudf::strings::detail::copy_range for offsetalator (#15010) @davidwendt
- Use offsetalator in cudf::getjsonobject() (#15009) @davidwendt
- Align integral types in ORC to specs (#15008) @vuule
- Clean up detail sequence header inclusion (#15007) @PointKernel
- Add groupby.apply(include_groups=) to match pandas 2.2 deprecation (#15006) @mroeschke
- Use offsetalator in cudf::interleave_columns() (#15004) @davidwendt
- Use offsetalator in cudf::rowbitcount() (#15003) @davidwendt
- Use offsetalator in cudf::strings::wrap() (#15002) @davidwendt
- Use offsetalator in cudf::strings::reverse (#15001) @davidwendt
- Deprecate groupby fillna (#15000) @mroeschke
- Ensure to_* IO methods respect pandas 2.2 keyword only deprecation (#14999) @mroeschke
- Remove unneeded calls to createcharschild_column utility (#14997) @davidwendt
- Add environment-agnostic scripts for running ctests and pytests (#14992) @trxcllnt
- Filter all
DeprecationWarning's byArrowTable.to_pandas()(#14989) @galipremsagar - Deprecate replace with categorical columns (#14988) @mroeschke
- Deprecate delimwhitespace in readcsv for pandas 2.2 (#14986) @mroeschke
- Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
- Ensure that
ctestis called with--no-tests=error. (#14983) @bdice - Deprecate non-integer
periodsindate_rangeandinterval_range(#14976) @galipremsagar - Update ops-bot.yaml (#14974) @AyodeAwe
- Use page statistics in Parquet reader (#14973) @etseidl
- Use fused types for overloaded function signatures (#14969) @vyasr
- Deprecate certain frequency strings (#14967) @galipremsagar
- Update copyrights for 24.04. (#14964) @bdice
- Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
- Introduce
GetJsonObjectOptionsingetJSONObjectJava API (#14956) @SurajAralihalli - JNI JSON read with DataSource and infered schema, along with basic java nested Schema JSON reads (#14954) @revans2
- Make codecov only informational (always pass). (#14952) @bdice
- Replace legacy cudf and dask_cudf imports as (d)gd (#14944) @mroeschke
- Replace isdatetime64tz/interval_dtype with isinstance (#14943) @mroeschke
- Update tests for pandas 2. (#14941) @bdice
- Use more public pandas APIs (#14929) @mroeschke
- Replace local copyright check with pre-commit-hooks verify-copyright (#14917) @KyleFromNVIDIA
- Add
pandas-2.xsupport incudf(#14916) @galipremsagar - Use offsetalator in nvtext::bytepairencoding (#14888) @davidwendt
- De-DOS line-endings (#14880) @wence-
- Add detail
cuco_allocator(#14877) @PointKernel - Move all core types to using enum class in Cython (#14876) @vyasr
- Read
cudf.__version__in Sphinx build (#14872) @KyleFromNVIDIA - Use int64 offset types for accessing code-points in nvtext::normalize (#14868) @davidwendt
- Read version from VERSION file in CMake (#14867) @KyleFromNVIDIA
- Update conda-cpp-post-build-checks to branch-24.04. (#14854) @bdice
- Update cudf for compatibility with the latest cuco (#14849) @PointKernel
- Remove deprecated strings functions (#14848) @davidwendt
- Fix CI workflows for pandas-tests and add test summary. (#14847) @bdice
- Use offsetalator in cudf::strings::copy_slice (#14844) @davidwendt
- Fix V2 Parquet page alignment for use with zStandard compression (#14841) @etseidl
- Fix calls to deprecated strings factory API in examples. (#14838) @bdice
- Update pre-commit hooks (#14837) @bdice
- Use
rapids_cuda_set_runtimeto determine cuda runtime usage by target (#14833) @vyasr - Remove getmeminfo functions from custom memory resources (#14832) @harrism
- Fix debug build by splitting rowoperatortests_utilities.cu (#14826) @davidwendt
- Remove -DNVBenchENABLECUPTI=OFF. (#14820) @bdice
- Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
- Branch 24.04 merge branch 24.02 (#14809) @vyasr
- Branch 24.04 merge branch 24.02 (#14806) @vyasr
- Introduce basic "cudf" backend for Dask Expressions (#14805) @rjzamora
- Remove
build_struct|list_column(#14786) @mroeschke - Use offsetalator in nvtext tokenize functions (#14783) @davidwendt
- Reduce execution time of Python ORC tests (#14776) @vuule
- Use offsetalator in cudf::strings::split functions (#14757) @davidwendt
- Use offsetalator in cudf::strings::findall (#14745) @davidwendt
- Use offsetalator in cudf::strings::url_decode (#14744) @davidwendt
- Use getoffsetvalue utility in strings shift function (#14743) @davidwendt
- Use as_column instead of full (#14698) @mroeschke
- List all notable breaking changes (#13535) @galipremsagar
- C++
Published by raydouglass almost 2 years ago
https://github.com/rapidsai/cudf - v24.04.00
π¨ Breaking Changes
- Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
- Change exceptions thrown by copying APIs (#15319) @vyasr
- Change stringscolumnview::char_size to return int64 (#15197) @davidwendt
- Upgrade to
arrow-14.0.2(#15108) @galipremsagar - Add support for
pandas-2.2incudf(#15100) @galipremsagar - Deprecate cudf::hashing::sparkmurmurhash3x86_32 (#15074) @davidwendt
- Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
- Raise an error on import for unsupported GPUs. (#15053) @bdice
- Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
- Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
- Add
future_stacktoDataFrame.stack(#15015) @galipremsagar - Deprecate groupby fillna (#15000) @mroeschke
- Deprecate replace with categorical columns (#14988) @mroeschke
- Deprecate delimwhitespace in readcsv for pandas 2.2 (#14986) @mroeschke
- Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
- Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
- Add
pandas-2.xsupport incudf(#14916) @galipremsagar - Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
π Bug Fixes
- Fix an issue with creating a series from scalar when
dtype='category'(#15476) @galipremsagar - Update pre-commit-hooks to v0.0.3 (#15355) @KyleFromNVIDIA
- [BUG][JNI] Trigger MemoryBuffer.onClosed after memory is freed (#15351) @abellina
- Fix an issue with multiple short list rowgroups using the Parquet chunked reader. (#15342) @nvdbaranec
- Avoid importing dask-expr if "query-planning" config is
False(#15340) @rjzamora - Fix gtests/ERROR_TEST errors when run in Debug (#15317) @davidwendt
- Fix OOB read in
inflate_kernel(#15309) @vuule - Work around a cuFile error when running CSV tests with memcheck (#15293) @vuule
- Fix Doxygen upload directory (#15291) @KyleFromNVIDIA
- Fix Doxygen check (#15289) @KyleFromNVIDIA
- Reintroduce PANDASGE220 import (#15287) @wence-
- Fix mean computation for the geometric distribution in the data generator (#15282) @vuule
- Fix Parquet decimal64 stats (#15281) @etseidl
- Make linking of nvtx3-cpp BUILDLOCALINTERFACE (#15271) @KyleFromNVIDIA
- Workaround compute-sanitizer memcheck bug (#15259) @davidwendt
- Cleanup
hostdevice_vectorand add more APIs (#15252) @ttnghia - Fix number of rows in randomly generated lists columns (#15248) @vuule
- Fix wrong output for
collect_list/collect_setof lists column (#15243) @ttnghia - Fix testchunkedPackTwoPasses to copy from the bounce buffer (#15220) @abellina
- Fix accessing
.columnsby an external API (#15212) @galipremsagar - [JNI] Disable testChunkedPackTwoPasses for now (#15210) @abellina
- Update labeler and codeowner configs for CMake files (#15208) @PointKernel
- Avoid dict normalization in
__dask_tokenize__(#15187) @rjzamora - Fix memcheck error in distinct inner join (#15164) @PointKernel
- Remove unneeded script parameters in testcppmemcheck.sh (#15158) @davidwendt
- Fix
ListColumn.to_pandas()to retainlisttype (#15155) @galipremsagar - Avoid factorization in MultiIndex.to_pandas (#15150) @mroeschke
- Fix GroupBy.get_group and GroupBy.indices (#15143) @wence-
- Remove
constfromrange_window_bounds::_extent. (#15138) @mythrocks - DataFrame.columns = ... retains RangeIndex & set dtype (#15129) @mroeschke
- Correctly handle output for
GroupBy.applywhen chunk results are reindexed series (#15109) @brandon-b-miller - Fix Series.groupby.shift with a MultiIndex (#15098) @mroeschke
- Fix reductions when DataFrame has MulitIndex columns (#15097) @mroeschke
- Fix deprecation warnings for deprecated hash() calls (#15095) @davidwendt
- Add support for arrow
large_stringincudf(#15093) @galipremsagar - Fix
sort_valuespytest failure with pandas-2.x regression (#15092) @galipremsagar - Resolve path parsing issues in
get_json_object(#15082) @SurajAralihalli - Fix bugs in handling of delta encodings (#15075) @etseidl
- Fix
is_device_write_preferredinvoid_sinkanduser_sink_wrapper(#15064) @vuule - Eliminate duplicate allocation of nested string columns (#15061) @vuule
- Raise an error on import for unsupported GPUs. (#15053) @bdice
- Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
- Fix
Index.differenceto handle duplicate values when one of the inputs is empty (#15016) @galipremsagar - Add
future_stacktoDataFrame.stack(#15015) @galipremsagar - Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
- Fix
DataFrame.sort_indexto respectignore_indexon all axis (#14995) @galipremsagar - Raise for pyarrow array that is tz-aware (#14980) @mroeschke
- Direct
SeriesGroupBy.aggregatetoSeriesGroupBy.agg(#14971) @rjzamora - Respect IntervalDtype and CategoricalDtype objects passed by users (#14961) @mroeschke
- unset
CUDF_SPILLafter a pytest (#14958) @galipremsagar - Fix Null literals to be not parsed as string when mixed types as string is enabled in JSON reader (#14939) @karthikeyann
- Fix chunked reads of Parquet delta encoded pages (#14921) @etseidl
- Fix reading offset for data stream in ORC reader (#14911) @ttnghia
- Enable sanitizer check for a test case testORCReadAndWriteForDecimal128 (#14897) @res-life
- Fix dask token normalization (#14829) @rjzamora
- Fix 24.04 versions (#14825) @raydouglass
- Ensure slow private attrs are maybe proxies (#14380) @mroeschke
π Documentation
- Ignore DLManagedTensor in the docs build (#15392) @davidwendt
- Revert "Temporarily disable docs errors. (#15265)" (#15269) @bdice
- Temporarily disable docs errors. (#15265) @bdice
- Update
developer_guide.mdwith new guidance on quoted internal includes (#15238) @harrism - Fix broken link for developer guide (#15025) @sanjana098
- [DOC] Update typo in docs example of structscolumnwrapper (#14949) @karthikeyann
- Update cudf.pandas FAQ. (#14940) @bdice
- Optimize doc builds (#14856) @vyasr
- Add developer guideline to use east const. (#14836) @bdice
- Document how cuDF is pronounced (#14753) @pentschev
- Notes convert to Pandas-compat (#12641) @Touutae-lab
π New Features
- Address inconsistency in single quote normalization in JSON reader (#15324) @shrshi
- Use JNI pinned pool resource with cuIO (#15255) @abellina
- Add DELTABYTEARRAY encoder for Parquet (#15239) @etseidl
- Migrate filling operations to pylibcudf (#15225) @brandon-b-miller
- [JNI] rmm based pinned pool (#15219) @abellina
- Implement zero-copy host buffer source instead of using an arrow implementation (#15189) @vuule
- Enable creation of columns from scalar (#15181) @vyasr
- Use NVTX from GitHub. (#15178) @bdice
- Implement
segmented_row_bit_countfor computing row sizes by segments of rows (#15169) @ttnghia - Implement search using pylibcudf (#15166) @vyasr
- Add distinct left join (#15149) @PointKernel
- Add cardinality control for groupby benchs with flat types (#15134) @PointKernel
- Add ability to request Parquet encodings on a per-column basis (#15081) @etseidl
- Automate include grouping order in .clang-format (#15063) @harrism
- Requesting a clean build directory also clears Jitify cache (#15052) @robertmaynard
- API for JSON unquoted whitespace normalization (#15033) @shrshi
- Implement concatenate, lists.explode, merge, sorting, and stream compaction in pylibcudf (#15011) @vyasr
- Implement replace in pylibcudf (#15005) @vyasr
- Add distinct key inner join (#14990) @PointKernel
- Implement rolling in pylibcudf (#14982) @vyasr
- Implement joins in pylibcudf (#14972) @vyasr
- Implement scans and reductions in pylibcudf (#14970) @vyasr
- Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
- Implement groupby in pylibcudf (#14945) @vyasr
- Support casting of Map type to string in JSON reader (#14936) @karthikeyann
- POC for whitespace removal in input JSON data using FST (#14931) @shrshi
- Support for LZ4 compression in ORC and Parquet (#14906) @vuule
- Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
- Migrate unary operations to pylibcudf (#14850) @vyasr
- Migrate binary operations to pylibcudf (#14821) @vyasr
- Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
- Support CUDA 12.2 (#14712) @jameslamb
π οΈ Improvements
- Use
conda env create --yesinstead of--force(#15403) @bdice - Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
- Change exceptions thrown by copying APIs (#15319) @vyasr
- Enable branch testing for
cudf.pandas(#15316) @galipremsagar - Replace black with ruff-format (#15312) @mroeschke
- This fixes an NPE when trying to read empty JSON data by adding a new API for missing information (#15307) @revans2
- Address poor performance of Parquet string decoding (#15304) @etseidl
- Update script input name (#15301) @AyodeAwe
- Make testreadparquetpartitionedfiltered data deterministic (#15296) @mroeschke
- Add timeout for
cudf.pandaspandas tests (#15284) @galipremsagar - Add upper bound to prevent usage of NumPy 2 (#15283) @bdice
- Fix cudf::test::tohost return of hostvector (#15263) @davidwendt
- Implement grouped product scan (#15254) @wence-
- Add CUDA 12.4 to supported PTX versions (#15247) @brandon-b-miller
- Implement DataFrame|Series.squeeze (#15244) @mroeschke
- Roll back ipow changes due to register pressure. (#15242) @pmattione-nvidia
- Remove createcharschild_column utility (#15241) @davidwendt
- Update dlpack to version 0.8 (#15237) @dantegd
- Improve performance in JSON reader when
mixed_types_as_stringoption is enabled (#15236) @shrshi - Remove row conversion code from libcudf (#15234) @ttnghia
- Use variable substitution for RAPIDS version in Doxyfile (#15231) @KyleFromNVIDIA
- Add ListColumns.topandas(arrowtype=) (#15228) @mroeschke
- Treat dask-cudf CI artifacts as pure wheels (#15223) @bdice
- Clean up usage of CUDA_ARCH and other macros. (#15218) @bdice
- DOC: use constants in performance-comparisons.ipynb (#15215) @raybellwaves
- Rewrite conversion in terms of column (#15213) @vyasr
- Switch
pytest-xdistalgo toworksteal(#15207) @galipremsagar - Deprecate stringscolumnview::offsets_begin() (#15205) @davidwendt
- Add
get_upstream_resourcemethod tostream_checking_resource_adaptor(#15203) @miscco - Tune up row size estimation in the data generator (#15202) @vuule
- Fix
offsetvalue for generating test data inparquet_chunked_reader_test.cu(#15200) @ttnghia - Change stringscolumnview::char_size to return int64 (#15197) @davidwendt
- Fix includes for row_operators.cuh (#15194) @davidwendt
- Generalize GHA selectors for pure Python testing (#15191) @bdice
- Improvements for
__cuda_array_interface__tests (#15188) @bdice - Allow to_pandas to return pandas.ArrowDtype (#15182) @mroeschke
- Ignore
byte_rangeinread_jsonwhen the size is not smaller than the input data (#15180) @vuule - Expose new stablesort and finish streamcompaction in pylibcudf (#15175) @wence-
- [ci] update matrix filters for dask-cudf builds (#15174) @jameslamb
- Change makestringschildren to return uvector (#15171) @davidwendt
- Don't override to_pandas for Datelike columns (#15167) @mroeschke
- Drop python-snappy from dependencies. (#15161) @bdice
- Add microkernels for fixed-width and fixed-width dictionary in Parquet decode (#15159) @abellina
- Make HostColumnVector.DataType accessor methods public (#15157) @jbrennan333
- Java bindings for left outer distinct join (#15154) @jlowe
- Forward-merge branch-24.02 to branch-24.04 (#15153) @bdice
- Enable pandas pytests for
cudf.pandas(#15147) @galipremsagar - Add java option to keep quotes for JSON reads (#15146) @revans2
- Change cross-pandas-version testing in
cudf(#15145) @galipremsagar - Use
hostdevice_vectorinkernel_errorto avoid the pageable copy (#15140) @vuule - Clean up Columns.astype & cudf.dtype (#15125) @mroeschke
- Simplify some to_pandas implementations (#15123) @mroeschke
- Java: Add leak tracking for Scalar instances (#15121) @jlowe
- Remove calls to stringscolumnview::offsets_begin() (#15112) @davidwendt
- Add support for Python 3.11, require NumPy 1.23+ (#15111) @jameslamb
- Compile-time ipow computation with array lookup (#15110) @pmattione-nvidia
- Upgrade to
arrow-14.0.2(#15108) @galipremsagar - Dynamically set version in RAPIDS doc builds (#15101) @jakirkham
- Add support for
pandas-2.2incudf(#15100) @galipremsagar - Update devcontainers to CUDA Toolkit 12.2 (#15099) @trxcllnt
- Fix
datetimebinop pytest failures in pandas-2.2 (#15090) @galipremsagar - Validate types in pylibcudf Column/Table constructors (#15088) @wence-
- xfail testjoinorderingpandascompat for pandas 2.2 (#15080) @mroeschke
- Add general purpose host memory allocator reference to cuIO with a demo of pooled-pinned allocation. (#15079) @nvdbaranec
- Adjust test_binops for pandas 2.2 (#15078) @mroeschke
- Remove offsetsbegin() call from nvtext::generatengrams (#15077) @davidwendt
- Use offsetalator in cudf::detail::hasnonemptynull_rows (#15076) @davidwendt
- Deprecate cudf::hashing::sparkmurmurhash3x86_32 (#15074) @davidwendt
- Fix cudf::test::to_host to handle both offset types for strings columns (#15073) @davidwendt
- Add condition for testgroupbynulls_basic in pandas 2.2 (#15072) @mroeschke
- xfail tests in testudfmasked_ops due to pandas 2.2 bug (#15071) @mroeschke
- target branch-24.04 for GitHub Actions workflows (#15069) @jameslamb
- Implement stable version of
cudf::sort(#15066) @wence- - Fix ORC and JSON tests failures for pandas 2.2 (#15062) @mroeschke
- Adjust test_joining for pandas 2.2 (#15060) @mroeschke
- Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
- Fix test_resample index dtype checking for pandas 2.2 (#15058) @mroeschke
- Split out strings/replace.cu and rework its gtests (#15054) @davidwendt
- Avoid incompatible value type setting in test_rolling for pandas 2.2 (#15050) @mroeschke
- Change chained replace inplace test to COW test for pandas 2.2 (#15049) @mroeschke
- Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
- Avoid chained indexing in test_indexing for pandas 2.2 (#15045) @mroeschke
- Avoid pandas 2.2
DeprecationWarningin test_hdf (#15044) @mroeschke - Use appropriate makeoffsetschild_column for building lists columns (#15043) @davidwendt
- Factor out position-offsets logic from strings split_helper utility (#15040) @davidwendt
- Forward-merge branch-24.02 to branch-24.04 (#15039) @bdice
- Clean up nvtx macros (#15038) @PointKernel
- Add xfailures for test_applymap for pandas 2.2 (#15034) @mroeschke
- Expose libcudf filter expression in read_parquet (#15028) @wence-
- Adjust tests in test_dataframe.py for pandas 2.2 (#15023) @mroeschke
- Adjust testdatetimeinfer_format for pandas 2.2 (#15021) @mroeschke
- Performance optimizations for parquet sub-rowgroup reader. (#15020) @nvdbaranec
- JNI bindings for distincthashjoin (#15019) @jlowe
- Change copyifsafe to call thrust instead of the overload function (#15018) @davidwendt
- Improve performance of copyifelse for long strings (#15017) @davidwendt
- Fix isstringdtype test for pandas 2.2 (#15012) @mroeschke
- Rework cudf::strings::detail::copy_range for offsetalator (#15010) @davidwendt
- Use offsetalator in cudf::getjsonobject() (#15009) @davidwendt
- Align integral types in ORC to specs (#15008) @vuule
- Clean up detail sequence header inclusion (#15007) @PointKernel
- Add groupby.apply(include_groups=) to match pandas 2.2 deprecation (#15006) @mroeschke
- Use offsetalator in cudf::interleave_columns() (#15004) @davidwendt
- Use offsetalator in cudf::rowbitcount() (#15003) @davidwendt
- Use offsetalator in cudf::strings::wrap() (#15002) @davidwendt
- Use offsetalator in cudf::strings::reverse (#15001) @davidwendt
- Deprecate groupby fillna (#15000) @mroeschke
- Ensure to_* IO methods respect pandas 2.2 keyword only deprecation (#14999) @mroeschke
- Remove unneeded calls to createcharschild_column utility (#14997) @davidwendt
- Add environment-agnostic scripts for running ctests and pytests (#14992) @trxcllnt
- Filter all
DeprecationWarning's byArrowTable.to_pandas()(#14989) @galipremsagar - Deprecate replace with categorical columns (#14988) @mroeschke
- Deprecate delimwhitespace in readcsv for pandas 2.2 (#14986) @mroeschke
- Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
- Ensure that
ctestis called with--no-tests=error. (#14983) @bdice - Deprecate non-integer
periodsindate_rangeandinterval_range(#14976) @galipremsagar - Update ops-bot.yaml (#14974) @AyodeAwe
- Use page statistics in Parquet reader (#14973) @etseidl
- Use fused types for overloaded function signatures (#14969) @vyasr
- Deprecate certain frequency strings (#14967) @galipremsagar
- Update copyrights for 24.04. (#14964) @bdice
- Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
- Introduce
GetJsonObjectOptionsingetJSONObjectJava API (#14956) @SurajAralihalli - JNI JSON read with DataSource and infered schema, along with basic java nested Schema JSON reads (#14954) @revans2
- Make codecov only informational (always pass). (#14952) @bdice
- Replace legacy cudf and dask_cudf imports as (d)gd (#14944) @mroeschke
- Replace isdatetime64tz/interval_dtype with isinstance (#14943) @mroeschke
- Update tests for pandas 2. (#14941) @bdice
- Use more public pandas APIs (#14929) @mroeschke
- Replace local copyright check with pre-commit-hooks verify-copyright (#14917) @KyleFromNVIDIA
- Add
pandas-2.xsupport incudf(#14916) @galipremsagar - Use offsetalator in nvtext::bytepairencoding (#14888) @davidwendt
- De-DOS line-endings (#14880) @wence-
- Add detail
cuco_allocator(#14877) @PointKernel - Move all core types to using enum class in Cython (#14876) @vyasr
- Read
cudf.__version__in Sphinx build (#14872) @KyleFromNVIDIA - Use int64 offset types for accessing code-points in nvtext::normalize (#14868) @davidwendt
- Read version from VERSION file in CMake (#14867) @KyleFromNVIDIA
- Update conda-cpp-post-build-checks to branch-24.04. (#14854) @bdice
- Update cudf for compatibility with the latest cuco (#14849) @PointKernel
- Remove deprecated strings functions (#14848) @davidwendt
- Fix CI workflows for pandas-tests and add test summary. (#14847) @bdice
- Use offsetalator in cudf::strings::copy_slice (#14844) @davidwendt
- Fix V2 Parquet page alignment for use with zStandard compression (#14841) @etseidl
- Fix calls to deprecated strings factory API in examples. (#14838) @bdice
- Update pre-commit hooks (#14837) @bdice
- Use
rapids_cuda_set_runtimeto determine cuda runtime usage by target (#14833) @vyasr - Remove getmeminfo functions from custom memory resources (#14832) @harrism
- Fix debug build by splitting rowoperatortests_utilities.cu (#14826) @davidwendt
- Remove -DNVBenchENABLECUPTI=OFF. (#14820) @bdice
- Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
- Branch 24.04 merge branch 24.02 (#14809) @vyasr
- Branch 24.04 merge branch 24.02 (#14806) @vyasr
- Introduce basic "cudf" backend for Dask Expressions (#14805) @rjzamora
- Remove
build_struct|list_column(#14786) @mroeschke - Use offsetalator in nvtext tokenize functions (#14783) @davidwendt
- Reduce execution time of Python ORC tests (#14776) @vuule
- Use offsetalator in cudf::strings::split functions (#14757) @davidwendt
- Use offsetalator in cudf::strings::findall (#14745) @davidwendt
- Use offsetalator in cudf::strings::url_decode (#14744) @davidwendt
- Use getoffsetvalue utility in strings shift function (#14743) @davidwendt
- Use as_column instead of full (#14698) @mroeschke
- List all notable breaking changes (#13535) @galipremsagar
- C++
Published by raydouglass almost 2 years ago
https://github.com/rapidsai/cudf - [NIGHTLY] v24.06.00
π Links
π¨ Breaking Changes
- Remove deprecated strings offsets_begin (#15454) @davidwendt
- Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
- Remove deprecated hash() and sparkmurmurhash3x86_32() (#15375) @davidwendt
- Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
- [FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
- Align date_range defaults with pandas, support tz (#15139) @mroeschke
π Bug Fixes
- nanoarrow uses package override for proper pinned versions generation (#15515) @robertmaynard
- Make improvements in pandas-test reporting (#15485) @galipremsagar
- Fixed page data truncation in parquet writer under certain conditions. (#15474) @nvdbaranec
- Only use data_type constructor with scale for decimal types (#15472) @wence-
- Avoid "p2p" shuffle as a default when
dask_cudfis imported (#15469) @rjzamora - Fix debug build errors from toarrowdevice_test.cpp (#15463) @davidwendt
- Fix basenormalator::integersizeof_fn integer dispatch (#15457) @davidwendt
- Allow consumers of static builds to find nanoarrow (#15456) @robertmaynard
- Allow jit compilation when using a splayed CUDA toolkit (#15451) @robertmaynard
- Test static builds in CI and fix nanoarrow configure (#15437) @vyasr
- Fixes potential race in JSON parser when parsing JSON lines format and when recovering from invalid lines (#15419) @elstehle
- Fix errors in chunked ORC writer when no tables were (successfully) written (#15393) @vuule
- Support implicit array conversion with query-planning enabled (#15378) @rjzamora
- Fix arrow-based round trip of empty dataframes (#15373) @wence-
- Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
- Remove boundscheck=False setting in cython files (#15362) @wence-
- Patch dask-expr
varlogic in dask-cudf (#15347) @rjzamora - Fix for logical and syntactical errors in libcudf c++ examples (#15346) @mhaseeb123
- Disable dask-expr in docs builds. (#15343) @bdice
- Apply the cuFile error work around to data_sink as well (#15335) @vuule
π Documentation
- Add debug tips section to libcudf developer guide (#15329) @davidwendt
π New Features
- Introduce benchmark suite for JSON reader options (#15124) @shrshi
- Add
to_arrow_devicefunction to cudf interop using nanoarrow (#15047) @zeroshade
π οΈ Improvements
- Enable tests/scalar and test/series in cudf.pandas tests (#15486) @mroeschke
- Avoid .ordered and .categories from being settable in CategoricalColumn and CategoricalDtype (#15475) @mroeschke
- Ignore pandas tests for cudf.pandas that need motoserver (#15468) @mroeschke
- Use cachedproperty for NumericColumn.nancount instead of .nancount variable (#15466) @mroeschke
- Add custom status check workflow (#15464) @galipremsagar
- Enable tests/strings/test_api.py and tests/io/pytables in cudf.pandas tests (#15461) @mroeschke
- Remove deprecated strings offsets_begin (#15454) @davidwendt
- Enable tests/windows/ in cudf.pandas tests (#15444) @mroeschke
- Enable tests/interchange/test_impl.py in cudf.pandas tests (#15443) @mroeschke
- Enable tests/io/testuseragent.py in cudf pandas tests (#15442) @mroeschke
- Remove prior test skipping in run-pandas-tests with testing 2.2.1 (#15440) @mroeschke
- Support orc and text IO with dask-expr using legacy conversion (#15439) @rjzamora
- Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
- Enable
dask_cudfjson and s3 tests with query-planning on (#15408) @rjzamora - Bump ruff and codespell pre-commit checks (#15407) @mroeschke
- Enable all tests for
armarch (#15402) @galipremsagar - Remove deprecated hash() and sparkmurmurhash3x86_32() (#15375) @davidwendt
- Enable test-reporting for pandas pytests in CI (#15369) @galipremsagar
- Use logical types in Parquet reader (#15365) @etseidl
- Forward-merge branch-24.04 to branch-24.06 (#15349) @bdice
- Use ruff pydocstyle over pydocstyle pre-commit hook (#15345) @mroeschke
- Refactor stream mode setup for gtests (#15337) @davidwendt
- Avoid duplicate dask-cudf testing (#15333) @rjzamora
- Update udfcpp to use rapidscpm_cccl. (#15331) @bdice
- Forward-merge branch-24.04 into branch-24.06 skip ci @rapids-bot[bot]
- Allow
numeric_only=Truefor simple groupby reductions (#15326) @rjzamora - Drop CentOS 7 support. (#15323) @bdice
- Rework cudf::findandreplaceall to use gather-based makestrings_column (#15305) @davidwendt
- First pass at adding testing for pylibcudf (#15300) @vyasr
- [FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
- Rework cudf::replacenulls to use strings::detail::copyif_else (#15286) @davidwendt
- Large strings support in cudf::concatenate (#15195) @davidwendt
- Use less iscategorical_dtype (#15148) @mroeschke
- Align date_range defaults with pandas, support tz (#15139) @mroeschke
ModuleAcceleratorperformance: cache the result of checking if a caller is in the denylist (#15056) @shwina- Use offsetalator in cudf::strings::replace functions (#14824) @davidwendt
- Cleanup some timedelta/datetime column logic (#14715) @mroeschke
- Refactor numpy array input in as_column (#14651) @mroeschke
- C++
Published by rapids-bot[bot] almost 2 years ago
https://github.com/rapidsai/cudf - v24.02.02
π¨ Breaking Changes
- Remove **kwargs from astype (#14765) @mroeschke
- Remove mimesis as a testing dependency (#14723) @mroeschke
- Update to Dask's
shuffle_methodkwarg (#14708) @pentschev - Drop Pascal GPU support. (#14630) @bdice
- Update to CCCL 2.2.0. (#14576) @bdice
- Expunge as_frame conversions in Column algorithms (#14491) @wence-
- Deprecate cudf::makestringscolumn accepting typed offsets (#14461) @davidwendt
- Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
- Include writer code and writerVersion in ORC files (#14458) @vuule
- Remove null mask for zero nulls in json readers (#14451) @karthikeyann
- REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
- Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
- Move chars column to parent data buffer in strings column (#14202) @karthikeyann
- Switch to scikit-build-core (#13531) @vyasr
π Bug Fixes
- Bump to nvcomp 3.0.6. (#15128) @bdice
- [HOTFIX] Unpin numba<0.58 (#15031) @raydouglass
- Exclude tests from builds (#14981) @vyasr
- Fix the bounce buffer size in ORC writer (#14947) @vuule
- Revert sum/product aggregation to always produce
int64_ttype (#14907) @SurajAralihalli - Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
- Fix totalbytesize in Parquet row group metadata (#14802) @etseidl
- Fix index difference to follow the pandas format (#14789) @amiralimi
- Fix shared-workflows repo name (#14784) @raydouglass
- Remove unparseable attributes from all nodes (#14780) @vyasr
- Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
- Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
- Fix calls to deprecated strings factory API (#14771) @davidwendt
- Fix ptx file discovery in editable installs (#14767) @vyasr
- Revise
shuffledeprecation to align with dask/dask (#14762) @rjzamora - Enable intermediate proxies to be picklable (#14752) @shwina
- Add CUDFTESTPROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
- Fix CMake args (#14746) @vyasr
- Fix logic bug introduced in #14730 (#14742) @wence-
- [Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
- Fix
Groupby.get_group(#14728) @rjzamora - Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
- Split cuda versions for notebook testing (#14722) @raydouglass
- Fix to_numeric not preserving Series index and name (#14718) @mroeschke
- Update dask-cudf wheel name (#14713) @raydouglass
- Fix strings::contains matching end of string target (#14711) @davidwendt
- Update to Dask's
shuffle_methodkwarg (#14708) @pentschev - Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
- Potential fix for peformance regression in #14415 (#14706) @etseidl
- Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
- Skip numba test that fails on ARM (#14702) @brandon-b-miller
- Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
- Fix nanasnull not being respected when passing arrow object (#14688) @mroeschke
- Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
- Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
- Add BaseOffset as a final proxy type to pass instancechecks for offsets against
BaseOffset(#14678) @shwina - Add row conversion code from spark-rapids-jni (#14664) @ttnghia
- Unconditionally export the CCCL path (#14656) @vyasr
- Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
- Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
- Fix invalid memory access in Parquet reader (#14637) @etseidl
- Use columnempty over ascolumn([]) (#14632) @mroeschke
- Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
- Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
- Remove non-empty nulls in cudf::getjsonobject (#14609) @davidwendt
- Remove
cuda::proclaim_return_typefrom nested lambda (#14607) @ttnghia - Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
- Address potential race conditions in Parquet reader (#14602) @etseidl
- Fix DataFrame.reindex removing column name (#14601) @mroeschke
- Remove unsanitized input test data from copy gtests (#14600) @davidwendt
- Fix race detected in Parquet writer (#14598) @etseidl
- Correct invalid or missing return types (#14587) @robertmaynard
- Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
- Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
- Fix unsanitized nulls produced by
cudf::clampAPIs (#14580) @davidwendt - Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
- Fixes a symbol group lookup table issue (#14561) @elstehle
- Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
- REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
- Improve memory footprint of isin by using contains (#14478) @wence-
- Move creation of env.yaml outside the current directory (#14476) @davidwendt
- Enable
pd.Timestampobjects to be picklable whencudf.pandasis active (#14474) @shwina - Correct dtype of count aggregations on empty dataframes (#14473) @wence-
- Avoid DataFrame conversion in
MultiIndex.from_pandas(#14470) @mroeschke - JSON writer: avoid default stream use in
string_scalarconstructors (#14444) @vuule - Fix default stream use in the CSV reader (#14443) @vuule
- Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
- Defer PTX file load to runtime (#13690) @brandon-b-miller
π Documentation
- Disable parallel build (#14796) @vyasr
- Add pylibcudf to the docs (#14791) @vyasr
- Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
- Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
- More doxygen fixes (#14639) @vyasr
- Enable doxygen XML generation and fix issues (#14477) @vyasr
- Some doxygen improvements (#14469) @vyasr
- Remove warning in dask-cudf docs (#14454) @wence-
- Update README links with redirects. (#14378) @bdice
- Add pip install instructions to README (#13677) @shwina
π New Features
- Add ci check for external kernels (#14768) @robertmaynard
- JSON single quote normalization API (#14729) @shrshi
- Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
- Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
- Don't constrain
numba<0.58(#14616) @brandon-b-miller - Add DELTALENGTHBYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
- JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
- JSON quote normalization (#14545) @shrshi
- Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
- Implement more copying APIs in pylibcudf (#14508) @vyasr
- Include writer code and writerVersion in ORC files (#14458) @vuule
- Parquet sub-rowgroup reading. (#14360) @nvdbaranec
- Move chars column to parent data buffer in strings column (#14202) @karthikeyann
- PARQUET-2261 Size Statistics (#14000) @etseidl
- Improve GroupBy JIT error handling (#13854) @brandon-b-miller
- Generate unified Python/C++ docs (#13846) @vyasr
- Expand JIT groupby test suite (#13813) @brandon-b-miller
π οΈ Improvements
- Pin
pytest<8(#14920) @galipremsagar - Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
- Clean up
TimedeltaIndex.__init__constructor (#14775) @mroeschke - Clean up
DatetimeIndex.__init__constructor (#14774) @mroeschke - Some
frame.pytyping, move seldom used methods inframe.py(#14766) @mroeschke - Remove **kwargs from astype (#14765) @mroeschke
- fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
- Add
pynvjitlinkas a dependency (#14763) @brandon-b-miller - Resolve degenerate performance in
create_structs_data(#14761) @SurajAralihalli - Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
- Pin pytest-cases<3.8.2 (#14756) @mroeschke
- Use fromdata instead of fromcolumns for initialzing Frame (#14755) @mroeschke
- Consolidate cudf object handling in as_column (#14754) @mroeschke
- Reduce execution time of Parquet C++ tests (#14750) @vuule
- Implement to_datetime(..., utc=True) (#14749) @mroeschke
- Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
- Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
- Implement
cudf.MultiIndex.from_arrays(#14740) @mroeschke - Remove unused/single use methods (#14739) @mroeschke
- refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
- Remove unneeded methods in Column (#14730) @mroeschke
- Clean up base column methods (#14725) @mroeschke
- Ensure column.fillna signatures are consistent (#14724) @mroeschke
- Remove mimesis as a testing dependency (#14723) @mroeschke
- Replace asnumerical with asnumerical_column/codes (#14719) @mroeschke
- Use offsetalator in gather_chars (#14700) @davidwendt
- Use makestringschildren for fill() specialization logic (#14697) @davidwendt
- Change
io::detail::orcnamespace intoio::orc::detail(#14696) @ttnghia - Fix call to deprecated factory function (#14695) @davidwendt
- Use as_column instead of arange for range like inputs (#14689) @mroeschke
- Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
- Split parquet test into multiple files (#14663) @etseidl
- Custom error messages for IO with nonexistent files (#14662) @vuule
- Explicitly pass .dtype into isfoodtype functions (#14657) @mroeschke
- Basic validation in reader benchmarks (#14647) @vuule
- Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
- Consolidate memoryview handling in as_column (#14643) @mroeschke
- Convert
FieldTypeto scoped enum (#14642) @vuule - Use instance over isfoodtype (#14641) @mroeschke
- Use isinstance over isfoodtype internally (#14638) @mroeschke
- Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
- Drop nvbench patch for nvml. (#14631) @bdice
- Drop Pascal GPU support. (#14630) @bdice
- Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
- Create strings-specific makeoffsetschild_column for multiple offset types (#14612) @davidwendt
- Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
- Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
- Support
freqin DatetimeIndex (#14593) @shwina - Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
- Remove WORKSPACE env var from cudftest tempdirectory class (#14588) @davidwendt
- Use exceptions instead of return values to handle errors in
CompactProtocolReader(#14582) @vuule - Use cuda::proclaimreturntype on device lambdas. (#14577) @bdice
- Update to CCCL 2.2.0. (#14576) @bdice
- Update dependencies.yaml to new pip index (#14575) @vyasr
- Simplify Python CMake (#14565) @vyasr
- Java expose parquet passreadlimit (#14564) @revans2
- Add column sanitization checks in
CUDF_TEST_EXPECT_COLUMN_*macros (#14559) @SurajAralihalli - Use cudftest tempdirectory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
- Fix return type of prefix increment overloads (#14544) @vuule
- Make bpemergepairs_impl member private (#14543) @davidwendt
- Small clean up in
io::statistics(#14542) @vuule - Change json gtest environment variable to compile-time definition (#14541) @davidwendt
- Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
- Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
- Move non-templated inline function definitions from tableview.hpp to tableview.cpp (#14535) @davidwendt
- Add JNI for strings::code_points (#14533) @thirtiseven
- Add a test for issue 12773 (#14529) @vyasr
- Split libarrow build dependencies. (#14506) @bdice
- Implement
IndexedFrame.duplicatedwithdistinct_indices+scatter(#14493) @wence- - Expunge as_frame conversions in Column algorithms (#14491) @wence-
- Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
- Refactor Parquet kernel_error (#14464) @etseidl
- Deprecate cudf::makestringscolumn accepting typed offsets (#14461) @davidwendt
- Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
- Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
- Expose stream parameter in public nvtext APIs (#14456) @davidwendt
- Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
- Remove null mask for zero nulls in json readers (#14451) @karthikeyann
- Refactor cudf.Series.init (#14450) @mroeschke
- Remove the use of
volatilein Parquet (#14448) @vuule - REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
- Testing stream pool implementation (#14437) @shrshi
- Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
- Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
- Use isinstance(..., cudf.IntervalDtype) instead of isintervaldtype (#14424) @mroeschke
- Use isinstance(..., cudf.CategoricalDtype) instead of iscategoricaldtype (#14423) @mroeschke
- Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
- REF: Remove instances of pd.core (#14421) @mroeschke
- Expose streams in public filling APIs for label_bins (#14401) @ZelboK
- Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
- Limit DELTABINARYPACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
- Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
- Expose streams in Parquet reader and writer APIs (#14359) @shrshi
- Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
- Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
- Expose streams in ORC reader and writer APIs (#14350) @shrshi
- Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
- Add cuDF devcontainers (#14015) @trxcllnt
- Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
- Switch to scikit-build-core (#13531) @vyasr
- Simplify null count checking in column equality comparator (#13312) @vyasr
- C++
Published by raydouglass almost 2 years ago
https://github.com/rapidsai/cudf - v24.02.01
π¨ Breaking Changes
- Remove **kwargs from astype (#14765) @mroeschke
- Remove mimesis as a testing dependency (#14723) @mroeschke
- Update to Dask's
shuffle_methodkwarg (#14708) @pentschev - Drop Pascal GPU support. (#14630) @bdice
- Update to CCCL 2.2.0. (#14576) @bdice
- Expunge as_frame conversions in Column algorithms (#14491) @wence-
- Deprecate cudf::makestringscolumn accepting typed offsets (#14461) @davidwendt
- Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
- Include writer code and writerVersion in ORC files (#14458) @vuule
- Remove null mask for zero nulls in json readers (#14451) @karthikeyann
- REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
- Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
- Move chars column to parent data buffer in strings column (#14202) @karthikeyann
- Switch to scikit-build-core (#13531) @vyasr
π Bug Fixes
- [HOTFIX] Unpin numba<0.58 (#15031) @raydouglass
- Exclude tests from builds (#14981) @vyasr
- Fix the bounce buffer size in ORC writer (#14947) @vuule
- Revert sum/product aggregation to always produce
int64_ttype (#14907) @SurajAralihalli - Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
- Fix totalbytesize in Parquet row group metadata (#14802) @etseidl
- Fix index difference to follow the pandas format (#14789) @amiralimi
- Fix shared-workflows repo name (#14784) @raydouglass
- Remove unparseable attributes from all nodes (#14780) @vyasr
- Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
- Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
- Fix calls to deprecated strings factory API (#14771) @davidwendt
- Fix ptx file discovery in editable installs (#14767) @vyasr
- Revise
shuffledeprecation to align with dask/dask (#14762) @rjzamora - Enable intermediate proxies to be picklable (#14752) @shwina
- Add CUDFTESTPROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
- Fix CMake args (#14746) @vyasr
- Fix logic bug introduced in #14730 (#14742) @wence-
- [Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
- Fix
Groupby.get_group(#14728) @rjzamora - Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
- Split cuda versions for notebook testing (#14722) @raydouglass
- Fix to_numeric not preserving Series index and name (#14718) @mroeschke
- Update dask-cudf wheel name (#14713) @raydouglass
- Fix strings::contains matching end of string target (#14711) @davidwendt
- Update to Dask's
shuffle_methodkwarg (#14708) @pentschev - Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
- Potential fix for peformance regression in #14415 (#14706) @etseidl
- Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
- Skip numba test that fails on ARM (#14702) @brandon-b-miller
- Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
- Fix nanasnull not being respected when passing arrow object (#14688) @mroeschke
- Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
- Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
- Add BaseOffset as a final proxy type to pass instancechecks for offsets against
BaseOffset(#14678) @shwina - Add row conversion code from spark-rapids-jni (#14664) @ttnghia
- Unconditionally export the CCCL path (#14656) @vyasr
- Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
- Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
- Fix invalid memory access in Parquet reader (#14637) @etseidl
- Use columnempty over ascolumn([]) (#14632) @mroeschke
- Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
- Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
- Remove non-empty nulls in cudf::getjsonobject (#14609) @davidwendt
- Remove
cuda::proclaim_return_typefrom nested lambda (#14607) @ttnghia - Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
- Address potential race conditions in Parquet reader (#14602) @etseidl
- Fix DataFrame.reindex removing column name (#14601) @mroeschke
- Remove unsanitized input test data from copy gtests (#14600) @davidwendt
- Fix race detected in Parquet writer (#14598) @etseidl
- Correct invalid or missing return types (#14587) @robertmaynard
- Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
- Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
- Fix unsanitized nulls produced by
cudf::clampAPIs (#14580) @davidwendt - Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
- Fixes a symbol group lookup table issue (#14561) @elstehle
- Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
- REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
- Improve memory footprint of isin by using contains (#14478) @wence-
- Move creation of env.yaml outside the current directory (#14476) @davidwendt
- Enable
pd.Timestampobjects to be picklable whencudf.pandasis active (#14474) @shwina - Correct dtype of count aggregations on empty dataframes (#14473) @wence-
- Avoid DataFrame conversion in
MultiIndex.from_pandas(#14470) @mroeschke - JSON writer: avoid default stream use in
string_scalarconstructors (#14444) @vuule - Fix default stream use in the CSV reader (#14443) @vuule
- Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
- Defer PTX file load to runtime (#13690) @brandon-b-miller
π Documentation
- Disable parallel build (#14796) @vyasr
- Add pylibcudf to the docs (#14791) @vyasr
- Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
- Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
- More doxygen fixes (#14639) @vyasr
- Enable doxygen XML generation and fix issues (#14477) @vyasr
- Some doxygen improvements (#14469) @vyasr
- Remove warning in dask-cudf docs (#14454) @wence-
- Update README links with redirects. (#14378) @bdice
- Add pip install instructions to README (#13677) @shwina
π New Features
- Add ci check for external kernels (#14768) @robertmaynard
- JSON single quote normalization API (#14729) @shrshi
- Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
- Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
- Don't constrain
numba<0.58(#14616) @brandon-b-miller - Add DELTALENGTHBYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
- JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
- JSON quote normalization (#14545) @shrshi
- Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
- Implement more copying APIs in pylibcudf (#14508) @vyasr
- Include writer code and writerVersion in ORC files (#14458) @vuule
- Parquet sub-rowgroup reading. (#14360) @nvdbaranec
- Move chars column to parent data buffer in strings column (#14202) @karthikeyann
- PARQUET-2261 Size Statistics (#14000) @etseidl
- Improve GroupBy JIT error handling (#13854) @brandon-b-miller
- Generate unified Python/C++ docs (#13846) @vyasr
- Expand JIT groupby test suite (#13813) @brandon-b-miller
π οΈ Improvements
- Pin
pytest<8(#14920) @galipremsagar - Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
- Clean up
TimedeltaIndex.__init__constructor (#14775) @mroeschke - Clean up
DatetimeIndex.__init__constructor (#14774) @mroeschke - Some
frame.pytyping, move seldom used methods inframe.py(#14766) @mroeschke - Remove **kwargs from astype (#14765) @mroeschke
- fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
- Add
pynvjitlinkas a dependency (#14763) @brandon-b-miller - Resolve degenerate performance in
create_structs_data(#14761) @SurajAralihalli - Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
- Pin pytest-cases<3.8.2 (#14756) @mroeschke
- Use fromdata instead of fromcolumns for initialzing Frame (#14755) @mroeschke
- Consolidate cudf object handling in as_column (#14754) @mroeschke
- Reduce execution time of Parquet C++ tests (#14750) @vuule
- Implement to_datetime(..., utc=True) (#14749) @mroeschke
- Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
- Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
- Implement
cudf.MultiIndex.from_arrays(#14740) @mroeschke - Remove unused/single use methods (#14739) @mroeschke
- refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
- Remove unneeded methods in Column (#14730) @mroeschke
- Clean up base column methods (#14725) @mroeschke
- Ensure column.fillna signatures are consistent (#14724) @mroeschke
- Remove mimesis as a testing dependency (#14723) @mroeschke
- Replace asnumerical with asnumerical_column/codes (#14719) @mroeschke
- Use offsetalator in gather_chars (#14700) @davidwendt
- Use makestringschildren for fill() specialization logic (#14697) @davidwendt
- Change
io::detail::orcnamespace intoio::orc::detail(#14696) @ttnghia - Fix call to deprecated factory function (#14695) @davidwendt
- Use as_column instead of arange for range like inputs (#14689) @mroeschke
- Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
- Split parquet test into multiple files (#14663) @etseidl
- Custom error messages for IO with nonexistent files (#14662) @vuule
- Explicitly pass .dtype into isfoodtype functions (#14657) @mroeschke
- Basic validation in reader benchmarks (#14647) @vuule
- Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
- Consolidate memoryview handling in as_column (#14643) @mroeschke
- Convert
FieldTypeto scoped enum (#14642) @vuule - Use instance over isfoodtype (#14641) @mroeschke
- Use isinstance over isfoodtype internally (#14638) @mroeschke
- Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
- Drop nvbench patch for nvml. (#14631) @bdice
- Drop Pascal GPU support. (#14630) @bdice
- Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
- Create strings-specific makeoffsetschild_column for multiple offset types (#14612) @davidwendt
- Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
- Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
- Support
freqin DatetimeIndex (#14593) @shwina - Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
- Remove WORKSPACE env var from cudftest tempdirectory class (#14588) @davidwendt
- Use exceptions instead of return values to handle errors in
CompactProtocolReader(#14582) @vuule - Use cuda::proclaimreturntype on device lambdas. (#14577) @bdice
- Update to CCCL 2.2.0. (#14576) @bdice
- Update dependencies.yaml to new pip index (#14575) @vyasr
- Simplify Python CMake (#14565) @vyasr
- Java expose parquet passreadlimit (#14564) @revans2
- Add column sanitization checks in
CUDF_TEST_EXPECT_COLUMN_*macros (#14559) @SurajAralihalli - Use cudftest tempdirectory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
- Fix return type of prefix increment overloads (#14544) @vuule
- Make bpemergepairs_impl member private (#14543) @davidwendt
- Small clean up in
io::statistics(#14542) @vuule - Change json gtest environment variable to compile-time definition (#14541) @davidwendt
- Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
- Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
- Move non-templated inline function definitions from tableview.hpp to tableview.cpp (#14535) @davidwendt
- Add JNI for strings::code_points (#14533) @thirtiseven
- Add a test for issue 12773 (#14529) @vyasr
- Split libarrow build dependencies. (#14506) @bdice
- Implement
IndexedFrame.duplicatedwithdistinct_indices+scatter(#14493) @wence- - Expunge as_frame conversions in Column algorithms (#14491) @wence-
- Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
- Refactor Parquet kernel_error (#14464) @etseidl
- Deprecate cudf::makestringscolumn accepting typed offsets (#14461) @davidwendt
- Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
- Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
- Expose stream parameter in public nvtext APIs (#14456) @davidwendt
- Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
- Remove null mask for zero nulls in json readers (#14451) @karthikeyann
- Refactor cudf.Series.init (#14450) @mroeschke
- Remove the use of
volatilein Parquet (#14448) @vuule - REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
- Testing stream pool implementation (#14437) @shrshi
- Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
- Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
- Use isinstance(..., cudf.IntervalDtype) instead of isintervaldtype (#14424) @mroeschke
- Use isinstance(..., cudf.CategoricalDtype) instead of iscategoricaldtype (#14423) @mroeschke
- Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
- REF: Remove instances of pd.core (#14421) @mroeschke
- Expose streams in public filling APIs for label_bins (#14401) @ZelboK
- Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
- Limit DELTABINARYPACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
- Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
- Expose streams in Parquet reader and writer APIs (#14359) @shrshi
- Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
- Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
- Expose streams in ORC reader and writer APIs (#14350) @shrshi
- Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
- Add cuDF devcontainers (#14015) @trxcllnt
- Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
- Switch to scikit-build-core (#13531) @vyasr
- Simplify null count checking in column equality comparator (#13312) @vyasr
- C++
Published by raydouglass about 2 years ago
https://github.com/rapidsai/cudf - v24.02.00
π¨ Breaking Changes
- Remove **kwargs from astype (#14765) @mroeschke
- Remove mimesis as a testing dependency (#14723) @mroeschke
- Update to Dask's
shuffle_methodkwarg (#14708) @pentschev - Drop Pascal GPU support. (#14630) @bdice
- Update to CCCL 2.2.0. (#14576) @bdice
- Expunge as_frame conversions in Column algorithms (#14491) @wence-
- Deprecate cudf::makestringscolumn accepting typed offsets (#14461) @davidwendt
- Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
- Include writer code and writerVersion in ORC files (#14458) @vuule
- Remove null mask for zero nulls in json readers (#14451) @karthikeyann
- REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
- Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
- Move chars column to parent data buffer in strings column (#14202) @karthikeyann
- Switch to scikit-build-core (#13531) @vyasr
π Bug Fixes
- Exclude tests from builds (#14981) @vyasr
- Fix the bounce buffer size in ORC writer (#14947) @vuule
- Revert sum/product aggregation to always produce
int64_ttype (#14907) @SurajAralihalli - Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
- Fix totalbytesize in Parquet row group metadata (#14802) @etseidl
- Fix index difference to follow the pandas format (#14789) @amiralimi
- Fix shared-workflows repo name (#14784) @raydouglass
- Remove unparseable attributes from all nodes (#14780) @vyasr
- Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
- Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
- Fix calls to deprecated strings factory API (#14771) @davidwendt
- Fix ptx file discovery in editable installs (#14767) @vyasr
- Revise
shuffledeprecation to align with dask/dask (#14762) @rjzamora - Enable intermediate proxies to be picklable (#14752) @shwina
- Add CUDFTESTPROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
- Fix CMake args (#14746) @vyasr
- Fix logic bug introduced in #14730 (#14742) @wence-
- [Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
- Fix
Groupby.get_group(#14728) @rjzamora - Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
- Split cuda versions for notebook testing (#14722) @raydouglass
- Fix to_numeric not preserving Series index and name (#14718) @mroeschke
- Update dask-cudf wheel name (#14713) @raydouglass
- Fix strings::contains matching end of string target (#14711) @davidwendt
- Update to Dask's
shuffle_methodkwarg (#14708) @pentschev - Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
- Potential fix for peformance regression in #14415 (#14706) @etseidl
- Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
- Skip numba test that fails on ARM (#14702) @brandon-b-miller
- Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
- Fix nanasnull not being respected when passing arrow object (#14688) @mroeschke
- Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
- Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
- Add BaseOffset as a final proxy type to pass instancechecks for offsets against
BaseOffset(#14678) @shwina - Add row conversion code from spark-rapids-jni (#14664) @ttnghia
- Unconditionally export the CCCL path (#14656) @vyasr
- Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
- Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
- Fix invalid memory access in Parquet reader (#14637) @etseidl
- Use columnempty over ascolumn([]) (#14632) @mroeschke
- Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
- Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
- Remove non-empty nulls in cudf::getjsonobject (#14609) @davidwendt
- Remove
cuda::proclaim_return_typefrom nested lambda (#14607) @ttnghia - Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
- Address potential race conditions in Parquet reader (#14602) @etseidl
- Fix DataFrame.reindex removing column name (#14601) @mroeschke
- Remove unsanitized input test data from copy gtests (#14600) @davidwendt
- Fix race detected in Parquet writer (#14598) @etseidl
- Correct invalid or missing return types (#14587) @robertmaynard
- Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
- Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
- Fix unsanitized nulls produced by
cudf::clampAPIs (#14580) @davidwendt - Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
- Fixes a symbol group lookup table issue (#14561) @elstehle
- Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
- REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
- Improve memory footprint of isin by using contains (#14478) @wence-
- Move creation of env.yaml outside the current directory (#14476) @davidwendt
- Enable
pd.Timestampobjects to be picklable whencudf.pandasis active (#14474) @shwina - Correct dtype of count aggregations on empty dataframes (#14473) @wence-
- Avoid DataFrame conversion in
MultiIndex.from_pandas(#14470) @mroeschke - JSON writer: avoid default stream use in
string_scalarconstructors (#14444) @vuule - Fix default stream use in the CSV reader (#14443) @vuule
- Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
- Defer PTX file load to runtime (#13690) @brandon-b-miller
π Documentation
- Disable parallel build (#14796) @vyasr
- Add pylibcudf to the docs (#14791) @vyasr
- Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
- Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
- More doxygen fixes (#14639) @vyasr
- Enable doxygen XML generation and fix issues (#14477) @vyasr
- Some doxygen improvements (#14469) @vyasr
- Remove warning in dask-cudf docs (#14454) @wence-
- Update README links with redirects. (#14378) @bdice
- Add pip install instructions to README (#13677) @shwina
π New Features
- Add ci check for external kernels (#14768) @robertmaynard
- JSON single quote normalization API (#14729) @shrshi
- Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
- Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
- Don't constrain
numba<0.58(#14616) @brandon-b-miller - Add DELTALENGTHBYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
- JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
- JSON quote normalization (#14545) @shrshi
- Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
- Implement more copying APIs in pylibcudf (#14508) @vyasr
- Include writer code and writerVersion in ORC files (#14458) @vuule
- Parquet sub-rowgroup reading. (#14360) @nvdbaranec
- Move chars column to parent data buffer in strings column (#14202) @karthikeyann
- PARQUET-2261 Size Statistics (#14000) @etseidl
- Improve GroupBy JIT error handling (#13854) @brandon-b-miller
- Generate unified Python/C++ docs (#13846) @vyasr
- Expand JIT groupby test suite (#13813) @brandon-b-miller
π οΈ Improvements
- Pin
pytest<8(#14920) @galipremsagar - Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
- Clean up
TimedeltaIndex.__init__constructor (#14775) @mroeschke - Clean up
DatetimeIndex.__init__constructor (#14774) @mroeschke - Some
frame.pytyping, move seldom used methods inframe.py(#14766) @mroeschke - Remove **kwargs from astype (#14765) @mroeschke
- fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
- Add
pynvjitlinkas a dependency (#14763) @brandon-b-miller - Resolve degenerate performance in
create_structs_data(#14761) @SurajAralihalli - Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
- Pin pytest-cases<3.8.2 (#14756) @mroeschke
- Use fromdata instead of fromcolumns for initialzing Frame (#14755) @mroeschke
- Consolidate cudf object handling in as_column (#14754) @mroeschke
- Reduce execution time of Parquet C++ tests (#14750) @vuule
- Implement to_datetime(..., utc=True) (#14749) @mroeschke
- Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
- Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
- Implement
cudf.MultiIndex.from_arrays(#14740) @mroeschke - Remove unused/single use methods (#14739) @mroeschke
- refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
- Remove unneeded methods in Column (#14730) @mroeschke
- Clean up base column methods (#14725) @mroeschke
- Ensure column.fillna signatures are consistent (#14724) @mroeschke
- Remove mimesis as a testing dependency (#14723) @mroeschke
- Replace asnumerical with asnumerical_column/codes (#14719) @mroeschke
- Use offsetalator in gather_chars (#14700) @davidwendt
- Use makestringschildren for fill() specialization logic (#14697) @davidwendt
- Change
io::detail::orcnamespace intoio::orc::detail(#14696) @ttnghia - Fix call to deprecated factory function (#14695) @davidwendt
- Use as_column instead of arange for range like inputs (#14689) @mroeschke
- Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
- Split parquet test into multiple files (#14663) @etseidl
- Custom error messages for IO with nonexistent files (#14662) @vuule
- Explicitly pass .dtype into isfoodtype functions (#14657) @mroeschke
- Basic validation in reader benchmarks (#14647) @vuule
- Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
- Consolidate memoryview handling in as_column (#14643) @mroeschke
- Convert
FieldTypeto scoped enum (#14642) @vuule - Use instance over isfoodtype (#14641) @mroeschke
- Use isinstance over isfoodtype internally (#14638) @mroeschke
- Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
- Drop nvbench patch for nvml. (#14631) @bdice
- Drop Pascal GPU support. (#14630) @bdice
- Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
- Create strings-specific makeoffsetschild_column for multiple offset types (#14612) @davidwendt
- Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
- Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
- Support
freqin DatetimeIndex (#14593) @shwina - Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
- Remove WORKSPACE env var from cudftest tempdirectory class (#14588) @davidwendt
- Use exceptions instead of return values to handle errors in
CompactProtocolReader(#14582) @vuule - Use cuda::proclaimreturntype on device lambdas. (#14577) @bdice
- Update to CCCL 2.2.0. (#14576) @bdice
- Update dependencies.yaml to new pip index (#14575) @vyasr
- Simplify Python CMake (#14565) @vyasr
- Java expose parquet passreadlimit (#14564) @revans2
- Add column sanitization checks in
CUDF_TEST_EXPECT_COLUMN_*macros (#14559) @SurajAralihalli - Use cudftest tempdirectory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
- Fix return type of prefix increment overloads (#14544) @vuule
- Make bpemergepairs_impl member private (#14543) @davidwendt
- Small clean up in
io::statistics(#14542) @vuule - Change json gtest environment variable to compile-time definition (#14541) @davidwendt
- Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
- Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
- Move non-templated inline function definitions from tableview.hpp to tableview.cpp (#14535) @davidwendt
- Add JNI for strings::code_points (#14533) @thirtiseven
- Add a test for issue 12773 (#14529) @vyasr
- Split libarrow build dependencies. (#14506) @bdice
- Implement
IndexedFrame.duplicatedwithdistinct_indices+scatter(#14493) @wence- - Expunge as_frame conversions in Column algorithms (#14491) @wence-
- Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
- Refactor Parquet kernel_error (#14464) @etseidl
- Deprecate cudf::makestringscolumn accepting typed offsets (#14461) @davidwendt
- Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
- Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
- Expose stream parameter in public nvtext APIs (#14456) @davidwendt
- Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
- Remove null mask for zero nulls in json readers (#14451) @karthikeyann
- Refactor cudf.Series.init (#14450) @mroeschke
- Remove the use of
volatilein Parquet (#14448) @vuule - REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
- Testing stream pool implementation (#14437) @shrshi
- Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
- Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
- Use isinstance(..., cudf.IntervalDtype) instead of isintervaldtype (#14424) @mroeschke
- Use isinstance(..., cudf.CategoricalDtype) instead of iscategoricaldtype (#14423) @mroeschke
- Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
- REF: Remove instances of pd.core (#14421) @mroeschke
- Expose streams in public filling APIs for label_bins (#14401) @ZelboK
- Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
- Limit DELTABINARYPACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
- Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
- Expose streams in Parquet reader and writer APIs (#14359) @shrshi
- Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
- Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
- Expose streams in ORC reader and writer APIs (#14350) @shrshi
- Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
- Add cuDF devcontainers (#14015) @trxcllnt
- Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
- Switch to scikit-build-core (#13531) @vyasr
- Simplify null count checking in column equality comparator (#13312) @vyasr
- C++
Published by raydouglass about 2 years ago
https://github.com/rapidsai/cudf - [NIGHTLY] v24.04.00
π Links
π¨ Breaking Changes
- Add
future_stacktoDataFrame.stack(#15015) @galipremsagar - Deprecate groupby fillna (#15000) @mroeschke
- Deprecate replace with categorical columns (#14988) @mroeschke
- Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
- Add
pandas-2.xsupport incudf(#14916) @galipremsagar
π Bug Fixes
- Fix
Index.differenceto handle duplicate values when one of the inputs is empty (#15016) @galipremsagar - Add
future_stacktoDataFrame.stack(#15015) @galipremsagar - Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
- Fix
DataFrame.sort_indexto respectignore_indexon all axis (#14995) @galipremsagar - Raise for pyarrow array that is tz-aware (#14980) @mroeschke
- Direct
SeriesGroupBy.aggregatetoSeriesGroupBy.agg(#14971) @rjzamora - unset
CUDF_SPILLafter a pytest (#14958) @galipremsagar - Fix dask token normalization (#14829) @rjzamora
- Fix 24.04 versions (#14825) @raydouglass
π Documentation
- [DOC] Update typo in docs example of structscolumnwrapper (#14949) @karthikeyann
- Update cudf.pandas FAQ. (#14940) @bdice
- Optimize doc builds (#14856) @vyasr
- Add developer guideline to use east const. (#14836) @bdice
- Notes convert to Pandas-compat (#12641) @Touutae-lab
π New Features
- Implement replace in pylibcudf (#15005) @vyasr
- Implement rolling in pylibcudf (#14982) @vyasr
- Implement joins in pylibcudf (#14972) @vyasr
- Implement scans and reductions in pylibcudf (#14970) @vyasr
- Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
- Implement groupby in pylibcudf (#14945) @vyasr
- POC for whitespace removal in input JSON data using FST (#14931) @shrshi
- Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
- Migrate unary operations to pylibcudf (#14850) @vyasr
- Migrate binary operations to pylibcudf (#14821) @vyasr
- Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
- Support CUDA 12.2 (#14712) @jameslamb
π οΈ Improvements
- Change copyifsafe to call thrust instead of the overload function (#15018) @davidwendt
- Fix isstringdtype test for pandas 2.2 (#15012) @mroeschke
- Clean up detail sequence header inclusion (#15007) @PointKernel
- Use offsetalator in cudf::strings::wrap() (#15002) @davidwendt
- Use offsetalator in cudf::strings::reverse (#15001) @davidwendt
- Deprecate groupby fillna (#15000) @mroeschke
- Ensure to_* IO methods respect pandas 2.2 keyword only deprecation (#14999) @mroeschke
- Filter all
DeprecationWarning's byArrowTable.to_pandas()(#14989) @galipremsagar - Deprecate replace with categorical columns (#14988) @mroeschke
- Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
- Ensure that
ctestis called with--no-tests=error. (#14983) @bdice - Deprecate non-integer
periodsindate_rangeandinterval_range(#14976) @galipremsagar - Use fused types for overloaded function signatures (#14969) @vyasr
- Deprecate certain frequency strings (#14967) @galipremsagar
- Update copyrights for 24.04. (#14964) @bdice
- Introduce
GetJsonObjectOptionsingetJSONObjectJava API (#14956) @SurajAralihalli - JNI JSON read with DataSource and infered schema, along with basic java nested Schema JSON reads (#14954) @revans2
- Make codecov only informational (always pass). (#14952) @bdice
- Replace legacy cudf and dask_cudf imports as (d)gd (#14944) @mroeschke
- Replace isdatetime64tz/interval_dtype with isinstance (#14943) @mroeschke
- Update tests for pandas 2. (#14941) @bdice
- Use more public pandas APIs (#14929) @mroeschke
- Add
pandas-2.xsupport incudf(#14916) @galipremsagar - Use offsetalator in nvtext::bytepairencoding (#14888) @davidwendt
- De-DOS line-endings (#14880) @wence-
- Add detail
cuco_allocator(#14877) @PointKernel - Move all core types to using enum class in Cython (#14876) @vyasr
- Use int64 offset types for accessing code-points in nvtext::normalize (#14868) @davidwendt
- Update conda-cpp-post-build-checks to branch-24.04. (#14854) @bdice
- Remove deprecated strings functions (#14848) @davidwendt
- Fix CI workflows for pandas-tests and add test summary. (#14847) @bdice
- Use offsetalator in cudf::strings::copy_slice (#14844) @davidwendt
- Fix V2 Parquet page alignment for use with zStandard compression (#14841) @etseidl
- Fix calls to deprecated strings factory API in examples. (#14838) @bdice
- Update pre-commit hooks (#14837) @bdice
- Use
rapids_cuda_set_runtimeto determine cuda runtime usage by target (#14833) @vyasr - Remove getmeminfo functions from custom memory resources (#14832) @harrism
- Fix debug build by splitting rowoperatortests_utilities.cu (#14826) @davidwendt
- Remove -DNVBenchENABLECUPTI=OFF. (#14820) @bdice
- Branch 24.04 merge branch 24.02 (#14809) @vyasr
- Branch 24.04 merge branch 24.02 (#14806) @vyasr
- Use offsetalator in nvtext tokenize functions (#14783) @davidwendt
- Reduce execution time of Python ORC tests (#14776) @vuule
- Use offsetalator in cudf::strings::split functions (#14757) @davidwendt
- Use offsetalator in cudf::strings::findall (#14745) @davidwendt
- Use offsetalator in cudf::strings::url_decode (#14744) @davidwendt
- Use getoffsetvalue utility in strings shift function (#14743) @davidwendt
- C++
Published by rapids-bot[bot] about 2 years ago
https://github.com/rapidsai/cudf - v23.12.01
π¨ Breaking Changes
- Raise error in
reindexwhenindexis not unique (#14400) @galipremsagar - Expose stream parameter to getjsonobject API (#14297) @davidwendt
- Refactor cudf_kafka to use skbuild (#14292) @jdye64
- Expose stream parameter in public strings convert APIs (#14255) @davidwendt
- Upgrade to nvCOMP 3.0.4 (#13815) @vuule
π Bug Fixes
- Fix synchronization issue when writing string columns with dictionary to ORC (#14595) @vuule
- Update actions/labeler to v4 (#14562) @raydouglass
- Fix data corruption when skipping rows (#14557) @etseidl
- Fix function name typo in
cudf.pandasprofiler (#14514) @galipremsagar - Fix intermediate type checking in expression parsing (#14445) @vyasr
- Forward merge
branch-23.10intobranch-23.12(#14435) @raydouglass - Remove needs: wheel-build-cudf. (#14427) @bdice
- Fix dask dependency in custreamz (#14420) @vyasr
- Ensure nvbench initializes nvml context when built statically (#14411) @robertmaynard
- Support java AST String literal with desired encoding (#14402) @winningsix
- Raise error in
reindexwhenindexis not unique (#14400) @galipremsagar - Always build nvbench statically so we don't need to package it (#14399) @robertmaynard
- Fix token-count logic in nvtext::tokenizewithvocabulary (#14393) @davidwendt
- Fix as_column(pd.Timestamp/Timedelta, length=) not respecting length (#14390) @mroeschke
- cudf.pandas: cuDF subpath checking in module
__getattr__(#14388) @shwina - Fix and disable encoding for nanosecond statistics in ORC writer (#14367) @vuule
- Add the new manylinux builds to the build job (#14351) @vyasr
- cudf jit parser now supports .pragma instructions with quotes (#14348) @robertmaynard
- Fix overflow check in
cudf::merge(#14345) @divyegala - Add cramjam (#14344) @vyasr
- Enable
dask_cudf/iopytests in CI (#14338) @galipremsagar - Temporarily avoid the current build of pydata-sphinx-theme (#14332) @vyasr
- Fix host buffer access from device function in the Parquet reader (#14328) @vuule
- Run IO tests for Dask-cuDF (#14327) @rjzamora
- Fix logical type issues in the Parquet writer (#14322) @vuule
- Remove aws-sdk-pinning and revert to arrow 12.0.1 (#14319) @vyasr
- test is_valid before reading column data (#14318) @etseidl
- Fix gtest validity setting for TextTokenizeTest.Vocabulary (#14312) @davidwendt
- Fixes stack context for json lines format that recovers from invalid JSON lines (#14309) @elstehle
- Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
- fixing thread index overflow issue (#14290) @hyperbolic2346
- Fix memset error in nvtext::editdistancematrix (#14283) @davidwendt
- Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
- Handle empty string correctly in Parquet statistics (#14257) @etseidl
- Fixes behaviour for incomplete lines when
recover_with_nullsis enabled (#14252) @elstehle - cudf::detail::pinned_allocator doesn't throw from
deallocate(#14251) @robertmaynard - Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
- Fix the precision when converting a decimal128 column to an arrow array (#14230) @jihoonson
- Fixing parquet list of struct interpretation (#13715) @hyperbolic2346
π Documentation
- Fix io reference in docs. (#14452) @bdice
- Update README (#14374) @shwina
- Example code for blog on new row comparators (#13795) @divyegala
π New Features
- Expose streams in public unary APIs (#14342) @vyasr
- Add python tests for Parquet DELTABINARYPACKED encoder (#14316) @etseidl
- Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
- Expose streams in public null mask APIs (#14263) @vyasr
- Expose streams in binaryop APIs (#14187) @vyasr
- Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
- Add decoder for DELTABYTEARRAY to Parquet reader (#14101) @etseidl
- Add DELTABINARYPACKED encoder for Parquet writer (#14100) @etseidl
- Add BytePairEncoder class to cuDF (#13891) @davidwendt
- Upgrade to nvCOMP 3.0.4 (#13815) @vuule
- Use
pynvjitlinkfor CUDA 12+ MVC (#13650) @brandon-b-miller
π οΈ Improvements
- Build concurrency for nightly and merge triggers (#14441) @bdice
- Cleanup remaining usages of dask dependencies (#14407) @galipremsagar
- Update to Arrow 14.0.1. (#14387) @bdice
- Remove Cython libcpp wrappers (#14382) @vyasr
- Forward-merge branch-23.10 to branch-23.12 (#14372) @bdice
- Upgrade to arrow 14 (#14371) @galipremsagar
- Fix a pytest typo in
test_kurt_skew_error(#14368) @galipremsagar - Use new rapids-dask-dependency metapackage for managing dask versions (#14364) @vyasr
- Change
nullable()tohas_nulls()incudf::detail::gather(#14363) @divyegala - Split up scan_inclusive.cu to improve its compile time (#14358) @davidwendt
- Implement userdatasourcewrapper isempty() and isdevicereadpreferred(). (#14357) @tpn
- Added streams to CSV reader and writer api (#14340) @shrshi
- Upgrade wheels to use arrow 13 (#14339) @vyasr
- Rework nvtext::bytepairencoding API (#14337) @davidwendt
- Improve performance of nvtext::tokenizewithvocabulary for long strings (#14336) @davidwendt
- Upgrade
arrowto13(#14330) @galipremsagar - Expose stream parameter in public nvtext replace APIs (#14329) @davidwendt
- Drop
pyorcdependency and usepandas/pyarrowinstead (#14323) @galipremsagar - Avoid
pyarrow.fsimport for local storage (#14321) @rjzamora - Unpin
daskanddistributedfor23.12development (#14320) @galipremsagar - Expose stream parameter in public nvtext tokenize APIs (#14317) @davidwendt
- Added streams to JSON reader and writer api (#14313) @shrshi
- Minor improvements in
source_info(#14308) @vuule - Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
- Add stream parameter to Set Operations (Public List APIs) (#14305) @SurajAralihalli
- Expose stream parameter to getjsonobject API (#14297) @davidwendt
- Sort dictionary data alphabetically in the ORC writer (#14295) @vuule
- Expose stream parameter in public strings filter APIs (#14293) @davidwendt
- Refactor cudf_kafka to use skbuild (#14292) @jdye64
- Update
shared-action-workflowsreferences (#14289) @AyodeAwe - Register
partdencode dispatch indask_cudf(#14287) @rjzamora - Update versioning strategy (#14285) @vyasr
- Move and rename byte-pair-encoding source files (#14284) @davidwendt
- Expose stream parameter in public strings combine APIs (#14281) @davidwendt
- Expose stream parameter in public strings contains APIs (#14280) @davidwendt
- Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
- Use branch-23.12 workflows. (#14271) @bdice
- Refactor LogicalType for Parquet (#14264) @etseidl
- Centralize chunked reading code in the parquet reader to readerimplchunking.cu (#14262) @nvdbaranec
- Expose stream parameter in public strings replace APIs (#14261) @davidwendt
- Expose stream parameter in public strings APIs (#14260) @davidwendt
- Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
- Make parquet schema index type consistent (#14256) @hyperbolic2346
- Expose stream parameter in public strings convert APIs (#14255) @davidwendt
- Add in java bindings for DataSource (#14254) @revans2
- Reimplement
cudf::mergefor nested types without using comparators (#14250) @divyegala - Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
- Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
- Improve
contains_columnby invokingcontains_table(#14238) @PointKernel - Detect and report errors in Parquet header parsing (#14237) @etseidl
- Normalizing offsets iterator (#14234) @davidwendt
- Forward merge
23.10into23.12(#14231) @galipremsagar - Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
- Enable indexalator for device code (#14206) @davidwendt
- Marginally reduce memory footprint of joins (#14197) @wence-
- Add nvtx annotations to spilling-based data movement (#14196) @wence-
- Optimize ORC writer for decimal columns (#14190) @vuule
- Remove the use of volatile in ORC (#14175) @vuule
- Add
bytes_per_secondto distinctcount of streamcompaction nvbench. (#14172) @Blonck - Add
bytes_per_secondto transpose benchmark (#14170) @Blonck - cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
- Add
bytes_per_secondto shift benchmark (#13950) @Blonck - Extract
debug_utilities.hpp/cufromcolumn_utilities.hpp/cu(#13720) @ttnghia
- C++
Published by raydouglass about 2 years ago
https://github.com/rapidsai/cudf - v23.12.00
π¨ Breaking Changes
- Raise error in
reindexwhenindexis not unique (#14400) @galipremsagar - Expose stream parameter to getjsonobject API (#14297) @davidwendt
- Refactor cudf_kafka to use skbuild (#14292) @jdye64
- Expose stream parameter in public strings convert APIs (#14255) @davidwendt
- Upgrade to nvCOMP 3.0.4 (#13815) @vuule
π Bug Fixes
- Update actions/labeler to v4 (#14562) @raydouglass
- Fix data corruption when skipping rows (#14557) @etseidl
- Fix function name typo in
cudf.pandasprofiler (#14514) @galipremsagar - Fix intermediate type checking in expression parsing (#14445) @vyasr
- Forward merge
branch-23.10intobranch-23.12(#14435) @raydouglass - Remove needs: wheel-build-cudf. (#14427) @bdice
- Fix dask dependency in custreamz (#14420) @vyasr
- Ensure nvbench initializes nvml context when built statically (#14411) @robertmaynard
- Support java AST String literal with desired encoding (#14402) @winningsix
- Raise error in
reindexwhenindexis not unique (#14400) @galipremsagar - Always build nvbench statically so we don't need to package it (#14399) @robertmaynard
- Fix token-count logic in nvtext::tokenizewithvocabulary (#14393) @davidwendt
- Fix as_column(pd.Timestamp/Timedelta, length=) not respecting length (#14390) @mroeschke
- cudf.pandas: cuDF subpath checking in module
__getattr__(#14388) @shwina - Fix and disable encoding for nanosecond statistics in ORC writer (#14367) @vuule
- Add the new manylinux builds to the build job (#14351) @vyasr
- cudf jit parser now supports .pragma instructions with quotes (#14348) @robertmaynard
- Fix overflow check in
cudf::merge(#14345) @divyegala - Add cramjam (#14344) @vyasr
- Enable
dask_cudf/iopytests in CI (#14338) @galipremsagar - Temporarily avoid the current build of pydata-sphinx-theme (#14332) @vyasr
- Fix host buffer access from device function in the Parquet reader (#14328) @vuule
- Run IO tests for Dask-cuDF (#14327) @rjzamora
- Fix logical type issues in the Parquet writer (#14322) @vuule
- Remove aws-sdk-pinning and revert to arrow 12.0.1 (#14319) @vyasr
- test is_valid before reading column data (#14318) @etseidl
- Fix gtest validity setting for TextTokenizeTest.Vocabulary (#14312) @davidwendt
- Fixes stack context for json lines format that recovers from invalid JSON lines (#14309) @elstehle
- Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
- fixing thread index overflow issue (#14290) @hyperbolic2346
- Fix memset error in nvtext::editdistancematrix (#14283) @davidwendt
- Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
- Handle empty string correctly in Parquet statistics (#14257) @etseidl
- Fixes behaviour for incomplete lines when
recover_with_nullsis enabled (#14252) @elstehle - cudf::detail::pinned_allocator doesn't throw from
deallocate(#14251) @robertmaynard - Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
- Fix the precision when converting a decimal128 column to an arrow array (#14230) @jihoonson
- Fixing parquet list of struct interpretation (#13715) @hyperbolic2346
π Documentation
- Fix io reference in docs. (#14452) @bdice
- Update README (#14374) @shwina
- Example code for blog on new row comparators (#13795) @divyegala
π New Features
- Expose streams in public unary APIs (#14342) @vyasr
- Add python tests for Parquet DELTABINARYPACKED encoder (#14316) @etseidl
- Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
- Expose streams in public null mask APIs (#14263) @vyasr
- Expose streams in binaryop APIs (#14187) @vyasr
- Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
- Add decoder for DELTABYTEARRAY to Parquet reader (#14101) @etseidl
- Add DELTABINARYPACKED encoder for Parquet writer (#14100) @etseidl
- Add BytePairEncoder class to cuDF (#13891) @davidwendt
- Upgrade to nvCOMP 3.0.4 (#13815) @vuule
- Use
pynvjitlinkfor CUDA 12+ MVC (#13650) @brandon-b-miller
π οΈ Improvements
- Build concurrency for nightly and merge triggers (#14441) @bdice
- Cleanup remaining usages of dask dependencies (#14407) @galipremsagar
- Update to Arrow 14.0.1. (#14387) @bdice
- Remove Cython libcpp wrappers (#14382) @vyasr
- Forward-merge branch-23.10 to branch-23.12 (#14372) @bdice
- Upgrade to arrow 14 (#14371) @galipremsagar
- Fix a pytest typo in
test_kurt_skew_error(#14368) @galipremsagar - Use new rapids-dask-dependency metapackage for managing dask versions (#14364) @vyasr
- Change
nullable()tohas_nulls()incudf::detail::gather(#14363) @divyegala - Split up scan_inclusive.cu to improve its compile time (#14358) @davidwendt
- Implement userdatasourcewrapper isempty() and isdevicereadpreferred(). (#14357) @tpn
- Added streams to CSV reader and writer api (#14340) @shrshi
- Upgrade wheels to use arrow 13 (#14339) @vyasr
- Rework nvtext::bytepairencoding API (#14337) @davidwendt
- Improve performance of nvtext::tokenizewithvocabulary for long strings (#14336) @davidwendt
- Upgrade
arrowto13(#14330) @galipremsagar - Expose stream parameter in public nvtext replace APIs (#14329) @davidwendt
- Drop
pyorcdependency and usepandas/pyarrowinstead (#14323) @galipremsagar - Avoid
pyarrow.fsimport for local storage (#14321) @rjzamora - Unpin
daskanddistributedfor23.12development (#14320) @galipremsagar - Expose stream parameter in public nvtext tokenize APIs (#14317) @davidwendt
- Added streams to JSON reader and writer api (#14313) @shrshi
- Minor improvements in
source_info(#14308) @vuule - Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
- Add stream parameter to Set Operations (Public List APIs) (#14305) @SurajAralihalli
- Expose stream parameter to getjsonobject API (#14297) @davidwendt
- Sort dictionary data alphabetically in the ORC writer (#14295) @vuule
- Expose stream parameter in public strings filter APIs (#14293) @davidwendt
- Refactor cudf_kafka to use skbuild (#14292) @jdye64
- Update
shared-action-workflowsreferences (#14289) @AyodeAwe - Register
partdencode dispatch indask_cudf(#14287) @rjzamora - Update versioning strategy (#14285) @vyasr
- Move and rename byte-pair-encoding source files (#14284) @davidwendt
- Expose stream parameter in public strings combine APIs (#14281) @davidwendt
- Expose stream parameter in public strings contains APIs (#14280) @davidwendt
- Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
- Use branch-23.12 workflows. (#14271) @bdice
- Refactor LogicalType for Parquet (#14264) @etseidl
- Centralize chunked reading code in the parquet reader to readerimplchunking.cu (#14262) @nvdbaranec
- Expose stream parameter in public strings replace APIs (#14261) @davidwendt
- Expose stream parameter in public strings APIs (#14260) @davidwendt
- Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
- Make parquet schema index type consistent (#14256) @hyperbolic2346
- Expose stream parameter in public strings convert APIs (#14255) @davidwendt
- Add in java bindings for DataSource (#14254) @revans2
- Reimplement
cudf::mergefor nested types without using comparators (#14250) @divyegala - Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
- Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
- Improve
contains_columnby invokingcontains_table(#14238) @PointKernel - Detect and report errors in Parquet header parsing (#14237) @etseidl
- Normalizing offsets iterator (#14234) @davidwendt
- Forward merge
23.10into23.12(#14231) @galipremsagar - Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
- Enable indexalator for device code (#14206) @davidwendt
- Marginally reduce memory footprint of joins (#14197) @wence-
- Add nvtx annotations to spilling-based data movement (#14196) @wence-
- Optimize ORC writer for decimal columns (#14190) @vuule
- Remove the use of volatile in ORC (#14175) @vuule
- Add
bytes_per_secondto distinctcount of streamcompaction nvbench. (#14172) @Blonck - Add
bytes_per_secondto transpose benchmark (#14170) @Blonck - cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
- Add
bytes_per_secondto shift benchmark (#13950) @Blonck - Extract
debug_utilities.hpp/cufromcolumn_utilities.hpp/cu(#13720) @ttnghia
- C++
Published by raydouglass about 2 years ago
https://github.com/rapidsai/cudf - [NIGHTLY] v24.02.00
π Links
π¨ Breaking Changes
- Expunge as_frame conversions in Column algorithms (#14491) @wence-
- Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
- REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
- Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
π Bug Fixes
- Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
- REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
- Improve memory footprint of isin by using contains (#14478) @wence-
- Enable
pd.Timestampobjects to be picklable whencudf.pandasis active (#14474) @shwina - Correct dtype of count aggregations on empty dataframes (#14473) @wence-
- Avoid DataFrame conversion in
MultiIndex.from_pandas(#14470) @mroeschke - JSON writer: avoid default stream use in
string_scalarconstructors (#14444) @vuule - Fix default stream use in the CSV reader (#14443) @vuule
- Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
π Documentation
- Some doxygen improvements (#14469) @vyasr
- Remove warning in dask-cudf docs (#14454) @wence-
- Update README links with redirects. (#14378) @bdice
π New Features
- Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
π οΈ Improvements
- Split libarrow build dependencies. (#14506) @bdice
- Expunge as_frame conversions in Column algorithms (#14491) @wence-
- Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
- Refactor Parquet kernel_error (#14464) @etseidl
- Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
- Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
- Expose stream parameter in public nvtext APIs (#14456) @davidwendt
- Remove the use of
volatilein Parquet (#14448) @vuule - REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
- Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
- Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
- Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
- REF: Remove instances of pd.core (#14421) @mroeschke
- Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
- Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
- Add cuDF devcontainers (#14015) @trxcllnt
- C++
Published by rapids-bot[bot] over 2 years ago
https://github.com/rapidsai/cudf - v23.10.02
π¨ Breaking Changes
- Raise error in
reindexwhenindexis not unique (#14429) @galipremsagar - Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
- Raise
MixedTypeErrorwhen a column of mixed-dtype is being constructed (#14050) @galipremsagar - Raise
NotImplementedErrorforMultiIndex.to_series(#14049) @galipremsagar - Create tableinputmetadata from a table_metadata (#13920) @etseidl
- Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
- Change
NAtoNaTfordatetimeandtimedeltatypes (#13868) @galipremsagar - Fix
any,allreduction behavior foraxis=Noneand warn for other reductions (#13831) @galipremsagar - Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
- Remove the libcudf cudf::offset_type type (#13788) @davidwendt
- Raise error when trying to join
datetimeandtimedeltatypes with other types (#13786) @galipremsagar - Update to Cython 3.0.0 (#13777) @vyasr
- Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
- Enforce deprecations in
23.10(#13732) @galipremsagar - Upgrade to arrow 12 (#13728) @galipremsagar
- Remove Arrow dependency from the
datasource.hpppublic header (#13698) @vuule
π Bug Fixes
- Raise error in
reindexwhenindexis not unique (#14429) @galipremsagar - Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
- Fix inaccuracy in decimal128 rounding. (#14233) @bdice
- Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
- Fix pytorch related pytest (#14198) @galipremsagar
- Pin to
aws-sdk-cpp<1.11(#14173) @pentschev - Fix assert failure for range window functions (#14168) @mythrocks
- Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
- Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
- Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
- Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
- Fix kernel launch error for cudf::io::orc::gpu::rowgroupcharcounts_kernel (#14139) @davidwendt
- Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
- Fix DataFrame.values with no columns but index (#14134) @mroeschke
- Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
- Add support for nested dict in
DataFrameconstructor (#14119) @galipremsagar - Restrict iterables of
DataFrame's as input toDataFrameconstructor (#14118) @galipremsagar - Allow
numeric_only=Truefor reduction operations on numeric types (#14111) @galipremsagar - Preserve name of the column while initializing a
DataFrame(#14110) @galipremsagar - Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
- Drop
kwargsfromSeries.count(#14106) @galipremsagar - Fix naming issues with
Index.to_frameandMultiIndex.to_frameAPIs (#14105) @galipremsagar - Only use memory resources that haven't been freed (#14103) @robertmaynard
- Add support for
__round__inSeriesandDataFrame(#14099) @galipremsagar - Validate ignoreindex type in dropduplicates (#14098) @mroeschke
- Fix renaming
SeriesandIndex(#14080) @galipremsagar - Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
- Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
- Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
- Use
conda mambabuildrather thanmamba mambabuild(#14067) @wence- - Raise NotImplementedError in todatetime with dayfirst without inferformat (#14058) @mroeschke
- Fix various issues in
Index.intersection(#14054) @galipremsagar - Fix
Index.differenceto match with pandas (#14053) @galipremsagar - Fix empty string column construction (#14052) @galipremsagar
- Fix
IntervalIndex.unionto preserve type-metadata (#14051) @galipremsagar - Raise
MixedTypeErrorwhen a column of mixed-dtype is being constructed (#14050) @galipremsagar - Raise
NotImplementedErrorforMultiIndex.to_series(#14049) @galipremsagar - Ignore compile_commands.json (#14048) @harrism
- Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
- Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
- Implement
sort_remainingforsort_index(#14033) @wence- - Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
- Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
- Preserve types of scalar being returned when possible in
quantile(#14014) @galipremsagar - Fix return type of
MultiIndex.difference(#14009) @galipremsagar - Raise an error when timezone subtypes are encountered in
pd.IntervalDtype(#14006) @galipremsagar - Fix map column can not be non-nullable for java (#14003) @res-life
- Fix
nameselection inIndex.differenceandIndex.intersection(#13986) @galipremsagar - Restore column type metadata with
dropnato fixfactorizeAPI (#13980) @galipremsagar - Use threadindextype to avoid out of bounds accesses in conditional joins (#13971) @vyasr
- Fix
MultiIndex.to_numpyto return numpy array with tuples (#13966) @galipremsagar - Use cudf::threadindextype in getjsonobject and tdigest kernels (#13962) @nvdbaranec
- Fix an issue with
IntervalIndex.reprwhen null values are present (#13958) @galipremsagar - Fix type metadata issue preservation with
Column.unique(#13957) @galipremsagar - Handle
Intervalscalars when passed in list-like inputs tocudf.Index(#13956) @galipremsagar - Fix setting of categories order when
dtypeis passed to aCategoricalColumn(#13955) @galipremsagar - Handle
as_indexinGroupBy.apply(#13951) @brandon-b-miller - Raise error for string types in
nsmallestandnlargest(#13946) @galipremsagar - Fix
indexofGroupby.applyresults when it is performed on empty objects (#13944) @galipremsagar - Fix integer overflow in shim
device_sumfunctions (#13943) @brandon-b-miller - Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
- Fixed processed bytes calculation in APPLYBOOLEANMASK benchmark. (#13937) @Blonck
- Fix construction of
Groupingobjects (#13932) @galipremsagar - Fix an issue with
locwhen column names isMultiIndex(#13929) @galipremsagar - Fix handling of typecasting in
searchsorted(#13925) @galipremsagar - Preserve index
nameinreindex(#13917) @galipremsagar - Use
cudf::thread_index_typein cuIO to prevent overflow in row indexing (#13910) @vuule - Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
- Use cudf::threadindextype in concatenate.cu. (#13906) @bdice
- Use cudf::threadindextype in replace.cu. (#13905) @bdice
- Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
- Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
- Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
- Use
thread_index_typeto avoid index overflow in grid-stride loops (#13895) @PointKernel - Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
- Raise error when trying to construct a
DataFramewith mixed types (#13889) @galipremsagar - Return
nanwhen one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller - Correctly detect the BOM mark in
read_csvwith compressed input (#13881) @vuule - Check for the presence of all values in
MultiIndex.isin(#13879) @galipremsagar - Fix nvtext::generatecharacterngrams performance regression for longer strings (#13874) @davidwendt
- Fix return type of
MultiIndex.levels(#13870) @galipremsagar - Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
- Disable construction of Index when
freqis set in pandas-compatibility mode (#13857) @galipremsagar - Fix an issue with fetching
NAfrom aTimedeltaColumn(#13853) @galipremsagar - Simplify implementation of interval_range() and fix behaviour for floating
freq(#13844) @shwina - Fix binary operations between
SeriesandIndex(#13842) @galipremsagar - Update makelistscolumnfromscalar to use makeoffsetschild_column utility (#13841) @davidwendt
- Fix read out of bounds in string concatenate (#13838) @pentschev
- Raise error for more cases when
timezone-awaredata is passed toas_column(#13835) @galipremsagar - Fix
any,allreduction behavior foraxis=Noneand warn for other reductions (#13831) @galipremsagar - Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
- Fix cuFile I/O factories (#13829) @vuule
- DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
- Branch 23.10 merge 23.08 (#13822) @vyasr
- Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
- No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
- Raise error when mixed types are being constructed (#13816) @galipremsagar
- Fix unbounded sequence issue in
DataFrameconstructor (#13811) @galipremsagar - Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
- Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
- Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
- Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
- Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
- Raise error when trying to join
datetimeandtimedeltatypes with other types (#13786) @galipremsagar - Fix negative unary operation for boolean type (#13780) @galipremsagar
- Fix contains(
in) method forSeries(#13779) @galipremsagar - Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
- Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
- Preserve names of column object in various APIs (#13772) @galipremsagar
- Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
- Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
- Provide our own Cython declaration for make_unique (#13746) @wence-
π Documentation
- Fix benchmark image. (#14376) @bdice
- Fix typo in docstring: metadata. (#14025) @bdice
- Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
- Simplify Python doc configuration (#13826) @vyasr
- Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
- Fix all warnings in Python docs (#13789) @vyasr
π New Features
- [Java] Add JNI bindings for
integers_to_hex(#14205) @razajafri - Propagate errors from Parquet reader kernels back to host (#14167) @vuule
- JNI for
HISTOGRAMandMERGE_HISTOGRAMaggregations (#14154) @ttnghia - Expose streams in all public sorting APIs (#14146) @vyasr
- Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
- Implement
GroupBy.value_countsto match pandas API (#14114) @stmio - Refactor parquet thrift reader (#14097) @etseidl
- Refactor
hash_reduce_by_row(#14095) @ttnghia - Support negative preceding/following for ROW window functions (#14093) @mythrocks
- Support for progressive parquet chunked reading. (#14079) @nvdbaranec
- Implement
HISTOGRAMandMERGE_HISTOGRAMaggregations (#14045) @ttnghia - Expose streams in public search APIs (#14034) @vyasr
- Expose streams in public replace APIs (#14010) @vyasr
- Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
- Expose streams in public filling APIs (#13990) @vyasr
- Expose streams in public concatenate APIs (#13987) @vyasr
- Use HostMemoryAllocator in jni::allocatehostbuffer (#13975) @gerashegalov
- Enable fractional null probability for hashing benchmark (#13967) @Blonck
- Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
- Add nvtext::tokenizewithvocabulary API (#13930) @davidwendt
- Rewrite
DataFrame.stackto support multi level column names (#13927) @isVoid - Add HostMemoryAllocator interface (#13924) @gerashegalov
- Global stream pool (#13922) @etseidl
- Create tableinputmetadata from a table_metadata (#13920) @etseidl
- Translate column size overflow exception to JNI (#13911) @mythrocks
- Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
- Exclude some tests from running with the compute sanitizer (#13872) @firestarman
- Expand statistics support in ORC writer (#13848) @vuule
- Register the memory mapped buffer in
datasourceto improve H2D throughput (#13814) @vuule - Add cudf::strings::find function with target per row (#13808) @davidwendt
- Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
- Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
- Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
- Support
corrinGroupBy.applythrough the jit engine (#13767) @shwina - Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
- Support more numeric types in
Groupby.applywithengine='jit'(#13729) @brandon-b-miller - [FEA] Add DELTABINARYPACKED decoding support to Parquet reader (#13637) @etseidl
- Read FIXEDLENBYTE_ARRAY as binary in parquet reader (#13437) @PointKernel
π οΈ Improvements
- Update
shared-action-workflowsreferences (backport from23.12to23.10) (#14300) @AyodeAwe - Pin
daskanddistributedfor23.10release (#14225) @galipremsagar - update rmm tag path (#14195) @AyodeAwe
- Disable
Recently UpdatedCheck (#14193) @ajschmidt8 - Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
- Add Parquet reader benchmarks for row selection (#14147) @vuule
- Update image names (#14145) @AyodeAwe
- Support callables in DataFrame.assign (#14142) @wence-
- Reduce memory usage of ascategoricalcolumn (#14138) @wence-
- Replace Python scalar conversions with libcudf (#14124) @vyasr
- Update to clang 16.0.6. (#14120) @bdice
- Fix type of empty
Indexand raise warning inSeriesconstructor (#14116) @galipremsagar - Add stream parameter to external dict APIs (#14115) @SurajAralihalli
- Add fallback matrix for nvcomp. (#14082) @bdice
- [Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
- Remove header tests (#14072) @ajschmidt8
- Refactor
contains_tablewith cuco::static_set (#14064) @PointKernel - Remove debug print in a Parquet test (#14063) @vuule
- Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
- Expose stream parameter in public strings find APIs (#14060) @davidwendt
- Update doxygen to 1.9.1 (#14059) @vyasr
- Remove the mr from the base fixture (#14057) @vyasr
- Expose streams in public strings case APIs (#14056) @davidwendt
- Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
- Use cudf::makeemptycolumn instead of column_view constructor (#14030) @davidwendt
- Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
- Explicitly depend on zlib in conda recipes (#14018) @wence-
- Use grid_stride for stride computations. (#13996) @bdice
- Fix an issue where casting null-array to
objectdtype will result in a failure (#13994) @galipremsagar - Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
- Enable
codesdtype parity in pandas-compatibility mode forfactorizeAPI (#13982) @galipremsagar - Fix
CategoricalIndexordering inGroupby.aggwhen pandas-compatibility mode is enabled (#13978) @galipremsagar - Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
- Use
thread_index_typeinpartitioning.cu(#13973) @divyegala - Use
cudf::thread_index_typeinmerge.cu(#13972) @divyegala - Use
copy-pr-bot(#13970) @ajschmidt8 - Use cudf::threadindextype in strings custom kernels (#13968) @davidwendt
- Add
bytes_per_secondto hash_partition benchmark (#13965) @Blonck - Added pinned pool reservation API for java (#13964) @revans2
- Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
- Add
bytes_per_secondto copyifelse benchmark (#13960) @Blonck - Add pandas compatible output to
Series.unique(#13959) @galipremsagar - Add
bytes_per_secondto compiled binaryop benchmark (#13938) @Blonck - Unpin
daskanddistributedfor23.10development (#13935) @galipremsagar - Make HostColumnVector.getRefCount public (#13934) @abellina
- Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
- Add java API to get size of host memory needed to copy column view (#13919) @revans2
- Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
- Enable hugepage for arrow host allocations (#13914) @madsbk
- Improve performance of nvtext::edit_distance (#13912) @davidwendt
- Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
- Use
empty()instead ofsize()where possible (#13908) @vuule - [JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
- Return
Timestamp&Timedeltafor fetching scalars inDatetimeIndex&TimedeltaIndex(#13896) @galipremsagar - Allow explicit
shuffle="p2p"within dask-cudf API (#13893) @rjzamora - Disable creation of
DatetimeIndexwhenfreqis passed tocudf.date_range(#13890) @galipremsagar - Bring parity with pandas for
datetime&timedeltacomparison operations (#13877) @galipremsagar - Change
NAtoNaTfordatetimeandtimedeltatypes (#13868) @galipremsagar - Raise error when
astype(object)is called in pandas compatibility mode (#13862) @galipremsagar - Fixes a performance regression in FST (#13850) @elstehle
- Set native handles to null on close in Java wrapper classes (#13818) @jlowe
- Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
- Update
lists::containsto experimental row comparator (#13810) @divyegala - Reduce
lists::containsdispatches for scalars (#13805) @divyegala - Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
- Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
- Remove the libcudf cudf::offset_type type (#13788) @davidwendt
- Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
- Update to Cython 3.0.0 (#13777) @vyasr
- Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
- Branch 23.10 merge 23.08 (#13773) @vyasr
- Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
- Branch 23.10 merge 23.08 (#13753) @vyasr
- Enforce deprecations in
23.10(#13732) @galipremsagar - Upgrade to arrow 12 (#13728) @galipremsagar
- Refactors JSON reader's pushdown automaton (#13716) @elstehle
- Remove Arrow dependency from the
datasource.hpppublic header (#13698) @vuule
- C++
Published by raydouglass over 2 years ago
https://github.com/rapidsai/cudf - v23.04.01
π¨ Breaking Changes
- Pin
daskanddistributedfor release (#13070) @galipremsagar - Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
- Update minimum
pandasandnumpypinnings (#12887) @galipremsagar - Deprecate
names&dtypeinIndex.copy(#12825) @galipremsagar - Deprecate
Index.is_*methods (#12820) @galipremsagar - Deprecate
datetime_is_numericfromdescribe(#12818) @galipremsagar - Deprecate
na_sentinelinfactorize(#12817) @galipremsagar - Make string methods return a Series with a useful Index (#12814) @shwina
- Produce useful guidance on overflow error in
to_csv(#12705) @wence- - Move
strings_udfcode into cuDF (#12669) @brandon-b-miller - Remove cudf::strings::repeatstringsoutputsizes and optional parameter from cudf::strings::repeatstrings (#12609) @davidwendt
- Replace message parsing with throwing more specific exceptions (#12426) @vyasr
π Bug Fixes
- Pin curand version (#13127) @vyasr
- Fix memcheck script to execute only _TEST files found in bin/gtests/libcudf (#13006) @davidwendt
- Fix
DataFrameconstructor to broadcast scalar inputs properly (#12997) @galipremsagar - Drop
force_nullable_schemafrom chunked parquet writer (#12996) @galipremsagar - Fix gtest column utility comparator diff reporting (#12995) @davidwendt
- Handle index names while performing
groupby(#12992) @galipremsagar - Fix
__setitem__on string columns when the scalar value ends in a null byte (#12991) @wence- - Fix
sort_valueswhen column is all empty strings (#12988) @eriknw - Remove unused variable and fix memory issue in ORC writer (#12984) @ttnghia
- Pre-emptive fix for upstream
dask.dataframe.read_parquetchanges (#12983) @rjzamora - Remove MANIFEST.in use auto-generated one for sdists and package_data for wheels (#12960) @vyasr
- Update to use rapids-export(COMPONENTS) feature. (#12959) @robertmaynard
- cudftestutil supports static gtest dependencies (#12957) @robertmaynard
- Include gtest in build environment. (#12956) @vyasr
- Correctly handle scalar indices in
Index.__getitem__(#12955) @wence- - Avoid building cython twice (#12945) @galipremsagar
- Fix set index error for Series rolling window operations (#12942) @galipremsagar
- Fix calculation of null counts for Parquet statistics (#12938) @etseidl
- Preserve integer dtype of hive-partitioned column containing nulls (#12930) @rjzamora
- Use getcurrentdeviceresource for intermediate allocations in COLLECTLIST window code (#12927) @karthikeyann
- Mark dlpack tensor deleter as noexcept to match PyCapsule_Destructor signature. (#12921) @bdice
- Fix conda recipe post-link.sh typo (#12916) @pentschev
- minrows and numrows are swapped in ComputePageSizes declaration in Parquet reader (#12886) @etseidl
- Expect cupy to now support bool arrays for dlpack. (#12883) @vyasr
- Use python -m pytest for nightly wheel tests (#12871) @bdice
- Parquet writer columnsize() should return a sizet (#12870) @etseidl
- Fix cudf::hash_partition kernel launch error with decimal128 types (#12863) @davidwendt
- Fix an issue with parquet chunked reader undercounting string lengths. (#12859) @nvdbaranec
- Remove tokenizers pre-install pinning. (#12854) @vyasr
- Fix parquet
RangeIndexbug (#12838) @rjzamora - Remove KAFKAHOSTTEST from compute-sanitizer check (#12831) @davidwendt
- Make string methods return a Series with a useful Index (#12814) @shwina
- Tell cudf_kafka to use header-only fmt (#12796) @vyasr
- Add
GroupBy.dtypes(#12783) @galipremsagar - Fix a leak in a test and clarify some test names (#12781) @revans2
- Fix bug in all-null list due to joinlistelements special handling (#12767) @karthikeyann
- Add try/except for expected null-schema error in read_parquet (#12756) @rjzamora
- Throw an exception if an unsupported page encoding is detected in Parquet reader (#12754) @etseidl
- Fix a bug with
num_keysin_scatter_by_slice(#12749) @thomcom - Bump pinned rapids wheel deps to 23.4 (#12735) @sevagh
- Rework logic in cudf::strings::split_record to improve performance (#12729) @davidwendt
- Add
always_nullableflag to Dremel encoding (#12727) @divyegala - Fix memcheck read error in compound segmented reduce (#12722) @davidwendt
- Fix faulty conditional logic in JIT
GroupBy.apply(#12706) @brandon-b-miller - Produce useful guidance on overflow error in
to_csv(#12705) @wence- - Handle parquet list data corner case (#12698) @nvdbaranec
- Fix missing trailing comma in json writer (#12688) @karthikeyann
- Remove child fom newCudaAsyncMemoryResource (#12681) @abellina
- Handle bool types in
roundAPI (#12670) @galipremsagar - Ensure all of device bitmask is initialized in from_arrow (#12668) @wence-
- Fix
from_arrowto load a sliced arrow table (#12665) @galipremsagar - Fix dask-cudf read_parquet bug for multi-file aggregation (#12663) @rjzamora
- Fix AllocateLikeTest gtests reading uninitialized null-mask (#12643) @davidwendt
- Fix
find_common_dtypeandvaluesto handle complex dtypes (#12537) @galipremsagar - Fix fetching of MultiIndex values when a label is passed (#12521) @galipremsagar
- Fix
Seriescomparison vs scalars (#12519) @brandon-b-miller - Allow casting from
UDFStringback toStringViewto call methods instrings_udf(#12363) @brandon-b-miller
π Documentation
- Fix
GroupBy.applydoc examples rendering (#12994) @brandon-b-miller - add sphinx building and s3 uploading for dask-cudf docs (#12982) @quasiben
- Add developer documentation forbidding default parameters in detail APIs (#12978) @vyasr
- Add README symlink for dask-cudf. (#12946) @bdice
- Remove return type from @return doxygen tags (#12908) @davidwendt
- Fix docs build to be
pydata-sphinx-theme=0.13.0compatible (#12874) @galipremsagar - Add skeleton API and prose documentation for dask-cudf (#12725) @wence-
- Enable doctests for GroupBy methods (#12658) @brandon-b-miller
- Add comment about CUB patch for SegmentedSortInt.Bool gtest (#12611) @davidwendt
π New Features
- Add JNI method for strings::replace multi variety (#12979) @NVnavkumar
- Add nunique aggregation support for cudf::segmented_reduce (#12972) @davidwendt
- Refactor orc chunked writer (#12949) @ttnghia
- Make Parquet writer
nullableoption application to single table writes (#12933) @vuule - Refactor
io::orc::ProtobufWriter(#12877) @ttnghia - Make timezone table independent from ORC (#12805) @vuule
- Cache JIT
GroupBy.applyfunctions (#12802) @brandon-b-miller - Implement initial support for avro logical types (#6482) (#12788) @tpn
- Update
tests/column_utilitiesto useexperimental::equalityrow comparator (#12777) @divyegala - Update
distinct/unique_counttoexperimental::rowhasher/comparator (#12776) @divyegala - Update
hash_partitionto useexperimental::row::row_hasher(#12761) @divyegala - Update
is_sortedto useexperimental::row::lexicographic(#12752) @divyegala - Update default data source in cuio reader benchmarks (#12740) @PointKernel
- Reenable stream identification library in CI (#12714) @vyasr
- Add
regex_programstrings splitting java APIs and tests (#12713) @cindyyuanjiang - Add
regex_programstrings replacing java APIs and tests (#12701) @cindyyuanjiang - Add
regex_programstrings extract java APIs and tests (#12699) @cindyyuanjiang - Variable fragment sizes for Parquet writer (#12685) @etseidl
- Add segmented reduction support for fixed-point types (#12680) @davidwendt
- Move
strings_udfcode into cuDF (#12669) @brandon-b-miller - Add
regex_programsearching APIs and related java classes (#12666) @cindyyuanjiang - Add logging to libcudf (#12637) @vuule
- Add compound aggregations to cudf::segmented_reduce (#12573) @davidwendt
- Convert
rankto use to experimental row comparators (#12481) @divyegala - Use rapids-cmake parallel testing feature (#12451) @robertmaynard
- Enable detection of undesired stream usage (#12089) @vyasr
π οΈ Improvements
- Pin
daskanddistributedfor release (#13070) @galipremsagar - Pin cupy in wheel tests to supported versions (#13041) @vyasr
- Pin numba version (#13001) @vyasr
- Rework gtests SequenceTest to remove using namepace cudf (#12985) @davidwendt
- Stop setting package version attribute in wheels (#12977) @vyasr
- Move detail reduction functions to cudf::reduction::detail namespace (#12971) @davidwendt
- Remove default detail mrs: part7 (#12970) @vyasr
- Remove default detail mrs: part6 (#12969) @vyasr
- Remove default detail mrs: part5 (#12968) @vyasr
- Remove default detail mrs: part4 (#12967) @vyasr
- Remove default detail mrs: part3 (#12966) @vyasr
- Remove default detail mrs: part2 (#12965) @vyasr
- Remove default detail mrs: part1 (#12964) @vyasr
- Add
force_nullable_schemaparameter to Parquet writer. (#12952) @galipremsagar - Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
- Remove remaining default stream parameters (#12943) @vyasr
- Fix cudf::segmented_reduce gtest for ANY aggregation (#12940) @davidwendt
- Implement
groupby.headandgroupby.tail(#12939) @wence- - Fix libcudf gtests to pass null-count=0 for empty validity masks (#12923) @davidwendt
- Migrate parquet encoding to use experimental row operators (#12918) @PointKernel
- Fix benchmarks coded in namespace cudf and using namespace cudf (#12915) @karthikeyann
- Fix io/text gtests coded in namespace cudf::test (#12914) @karthikeyann
- Pass
SCCACHE_S3_USE_SSLto conda builds (#12910) @ajschmidt8 - Fix FST, JSON gtests & benchmarks coded in namespace cudf::test (#12907) @karthikeyann
- Generate pyproject dependencies using dfg (#12906) @vyasr
- Update libcudf counting functions to specify cudf::size_type (#12904) @davidwendt
- Fix
motoenv vars & passAWS_SESSION_TOKENto conda builds (#12902) @ajschmidt8 - Rewrite CSV writer benchmark with nvbench (#12901) @PointKernel
- Rework some code logic to reduce iterator and comparator inlining to improve compile time (#12900) @davidwendt
- Deprecate
line_terminatorin favor oflineterminatorinto_csv(#12896) @wence- - Add
streamandmrparameters forstructs::detail::flatten_nested_columns(#12892) @ttnghia - Deprecate libcudf regex APIs accepting pattern strings directly (#12891) @davidwendt
- Remove default parameters from detail headers in include (#12888) @vyasr
- Update minimum
pandasandnumpypinnings (#12887) @galipremsagar - Implement
groupby.sample(#12882) @wence- - Update JNI build ENV default to gcc 11 (#12881) @pxLi
- Change return type of
cudf::structs::detail::flatten_nested_columnsto smart pointer (#12878) @ttnghia - Fix passing seed parameter to MurmurHash3_32 in cudf::hash() function (#12875) @davidwendt
- Remove manual artifact upload step in CI (#12869) @ajschmidt8
- Update to GCC 11 (#12868) @bdice
- Fix null hive-partition behavior in dask-cudf parquet (#12866) @rjzamora
- Update to protobuf>=4.21.6,<4.22. (#12864) @bdice
- Update RMM allocators (#12861) @pentschev
- Improve performance for replace-multi for long strings (#12858) @davidwendt
- Drop Python 3.7 handling for pickle protocol 4 (#12857) @jakirkham
- Migrate as much as possible to pyproject.toml (#12850) @vyasr
- Enable nbqa pre-commit hooks for isort and black. (#12848) @bdice
- Setting a threshold for KvikIO IO (#12841) @madsbk
- Update datasets download URL (#12840) @jjacobelli
- Make docs builds less verbose (#12836) @AyodeAwe
- Consolidate linter configs into pyproject.toml (#12834) @vyasr
- Deprecate
names&dtypeinIndex.copy(#12825) @galipremsagar - Deprecate
inplaceparameters in categorical methods (#12824) @galipremsagar - Add optional text file support to ninja-log utility (#12823) @davidwendt
- Deprecate
Index.is_*methods (#12820) @galipremsagar - Add dfg as a pre-commit hook (#12819) @vyasr
- Deprecate
datetime_is_numericfromdescribe(#12818) @galipremsagar - Deprecate
na_sentinelinfactorize(#12817) @galipremsagar - Shuffling read into a sub function in parquet read (#12809) @hyperbolic2346
- Fixing parquet coalescing of reads (#12808) @hyperbolic2346
- CI: Remove specification of manual stage for check_style.sh script. (#12803) @csadorf
- Add compute-sanitizer github workflow action to nightly tests (#12800) @davidwendt
- Enable groupby std and variance aggregation types in libcudf Debug build (#12799) @davidwendt
- Expose seed argument to hash_values (#12795) @ayushdg
- Fix groupby gtests coded in namespace cudf::test (#12784) @davidwendt
- Improve performance for cudf::strings::count_characters for long strings (#12779) @davidwendt
- Deallocate encoded data in ORC writer immediately after compression (#12770) @vuule
- Stop force pulling fmt in nvbench. (#12768) @vyasr
- Remove now redundant cuda initialization (#12758) @vyasr
- Adds JSON reader, writer io benchmark (#12753) @karthikeyann
- Use test paths relative to package directory. (#12751) @bdice
- Add build metrics report as artifact to cpp-build workflow (#12750) @davidwendt
- Add JNI methods for detecting and purging non-empty nulls from LIST and STRUCT (#12742) @razajafri
- Stop using versioneer to manage versions (#12741) @vyasr
- Reduce error handling verbosity in CI tests scripts (#12738) @AjayThorve
- Reduce the number of test cases in multibyte_split benchmark (#12737) @PointKernel
- Update shared workflow branches (#12733) @ajschmidt8
- JNI switches to nested JSON reader (#12732) @res-life
- Changing
cudf::io::source_infoto usecudf::host_span<std::byte>in a non-breaking form (#12730) @hyperbolic2346 - Add nvbench environment class for initializing RMM in benchmarks (#12728) @davidwendt
- Split C++ and Python build dependencies into separate lists. (#12724) @bdice
- Add build dependencies to Java tests. (#12723) @bdice
- Allow setting the seed argument for hash partition (#12715) @firestarman
- Remove gpuCI scripts. (#12712) @bdice
- Unpin
daskanddistributedfor development (#12710) @galipremsagar partition_by_hash(): use_split()(#12704) @madsbk- Remove DataFrame.quantiles from docs. (#12684) @bdice
- Fast path for
experimental::row::equality(#12676) @divyegala - Move date to build string in
condarecipe (#12661) @ajschmidt8 - Refactor reduction logic for fixed-point types (#12652) @davidwendt
- Pay off some JNI RMM API tech debt (#12632) @revans2
- Merge
copy-on-writefeature branch intobranch-23.04(#12619) @galipremsagar - Remove cudf::strings::repeatstringsoutputsizes and optional parameter from cudf::strings::repeatstrings (#12609) @davidwendt
- Pin cuda-nvrtc. (#12606) @bdice
- Remove cudf::test::print calls from libcudf gtests (#12604) @davidwendt
- Init JNI version 23.04.0-SNAPSHOT (#12599) @pxLi
- Add performance benchmarks to user facing docs (#12595) @galipremsagar
- Add docs build job (#12592) @AyodeAwe
- Replace message parsing with throwing more specific exceptions (#12426) @vyasr
- Support conversion to/from cudf in dask.dataframe.core.to_backend (#12380) @rjzamora
- C++
Published by raydouglass over 2 years ago
https://github.com/rapidsai/cudf - [NIGHTLY] v23.10.00
π Links
π¨ Breaking Changes
- Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
- Raise
MixedTypeErrorwhen a column of mixed-dtype is being constructed (#14050) @galipremsagar - Raise
NotImplementedErrorforMultiIndex.to_series(#14049) @galipremsagar - Create tableinputmetadata from a table_metadata (#13920) @etseidl
- Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
- Change
NAtoNaTfordatetimeandtimedeltatypes (#13868) @galipremsagar - Fix
any,allreduction behavior foraxis=Noneand warn for other reductions (#13831) @galipremsagar - Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
- Remove the libcudf cudf::offset_type type (#13788) @davidwendt
- Raise error when trying to join
datetimeandtimedeltatypes with other types (#13786) @galipremsagar - Update to Cython 3.0.0 (#13777) @vyasr
- Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
- Enforce deprecations in
23.10(#13732) @galipremsagar - Upgrade to arrow 12 (#13728) @galipremsagar
- Remove Arrow dependency from the
datasource.hpppublic header (#13698) @vuule
π Bug Fixes
- Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
- Fix inaccuracy in decimal128 rounding. (#14233) @bdice
- Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
- Fix pytorch related pytest (#14198) @galipremsagar
- Pin to
aws-sdk-cpp<1.11(#14173) @pentschev - Fix assert failure for range window functions (#14168) @mythrocks
- Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
- Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
- Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
- Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
- Fix kernel launch error for cudf::io::orc::gpu::rowgroupcharcounts_kernel (#14139) @davidwendt
- Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
- Fix DataFrame.values with no columns but index (#14134) @mroeschke
- Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
- Add support for nested dict in
DataFrameconstructor (#14119) @galipremsagar - Restrict iterables of
DataFrame's as input toDataFrameconstructor (#14118) @galipremsagar - Allow
numeric_only=Truefor reduction operations on numeric types (#14111) @galipremsagar - Preserve name of the column while initializing a
DataFrame(#14110) @galipremsagar - Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
- Drop
kwargsfromSeries.count(#14106) @galipremsagar - Fix naming issues with
Index.to_frameandMultiIndex.to_frameAPIs (#14105) @galipremsagar - Only use memory resources that haven't been freed (#14103) @robertmaynard
- Add support for
__round__inSeriesandDataFrame(#14099) @galipremsagar - Validate ignoreindex type in dropduplicates (#14098) @mroeschke
- Fix renaming
SeriesandIndex(#14080) @galipremsagar - Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
- Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
- Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
- Use
conda mambabuildrather thanmamba mambabuild(#14067) @wence- - Raise NotImplementedError in todatetime with dayfirst without inferformat (#14058) @mroeschke
- Fix various issues in
Index.intersection(#14054) @galipremsagar - Fix
Index.differenceto match with pandas (#14053) @galipremsagar - Fix empty string column construction (#14052) @galipremsagar
- Fix
IntervalIndex.unionto preserve type-metadata (#14051) @galipremsagar - Raise
MixedTypeErrorwhen a column of mixed-dtype is being constructed (#14050) @galipremsagar - Raise
NotImplementedErrorforMultiIndex.to_series(#14049) @galipremsagar - Ignore compile_commands.json (#14048) @harrism
- Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
- Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
- Implement
sort_remainingforsort_index(#14033) @wence- - Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
- Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
- Preserve types of scalar being returned when possible in
quantile(#14014) @galipremsagar - Fix return type of
MultiIndex.difference(#14009) @galipremsagar - Raise an error when timezone subtypes are encountered in
pd.IntervalDtype(#14006) @galipremsagar - Fix map column can not be non-nullable for java (#14003) @res-life
- Fix
nameselection inIndex.differenceandIndex.intersection(#13986) @galipremsagar - Restore column type metadata with
dropnato fixfactorizeAPI (#13980) @galipremsagar - Use threadindextype to avoid out of bounds accesses in conditional joins (#13971) @vyasr
- Fix
MultiIndex.to_numpyto return numpy array with tuples (#13966) @galipremsagar - Use cudf::threadindextype in getjsonobject and tdigest kernels (#13962) @nvdbaranec
- Fix an issue with
IntervalIndex.reprwhen null values are present (#13958) @galipremsagar - Fix type metadata issue preservation with
Column.unique(#13957) @galipremsagar - Handle
Intervalscalars when passed in list-like inputs tocudf.Index(#13956) @galipremsagar - Fix setting of categories order when
dtypeis passed to aCategoricalColumn(#13955) @galipremsagar - Handle
as_indexinGroupBy.apply(#13951) @brandon-b-miller - Raise error for string types in
nsmallestandnlargest(#13946) @galipremsagar - Fix
indexofGroupby.applyresults when it is performed on empty objects (#13944) @galipremsagar - Fix integer overflow in shim
device_sumfunctions (#13943) @brandon-b-miller - Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
- Fixed processed bytes calculation in APPLYBOOLEANMASK benchmark. (#13937) @Blonck
- Fix construction of
Groupingobjects (#13932) @galipremsagar - Fix an issue with
locwhen column names isMultiIndex(#13929) @galipremsagar - Fix handling of typecasting in
searchsorted(#13925) @galipremsagar - Preserve index
nameinreindex(#13917) @galipremsagar - Use
cudf::thread_index_typein cuIO to prevent overflow in row indexing (#13910) @vuule - Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
- Use cudf::threadindextype in concatenate.cu. (#13906) @bdice
- Use cudf::threadindextype in replace.cu. (#13905) @bdice
- Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
- Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
- Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
- Use
thread_index_typeto avoid index overflow in grid-stride loops (#13895) @PointKernel - Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
- Raise error when trying to construct a
DataFramewith mixed types (#13889) @galipremsagar - Return
nanwhen one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller - Correctly detect the BOM mark in
read_csvwith compressed input (#13881) @vuule - Check for the presence of all values in
MultiIndex.isin(#13879) @galipremsagar - Fix nvtext::generatecharacterngrams performance regression for longer strings (#13874) @davidwendt
- Fix return type of
MultiIndex.levels(#13870) @galipremsagar - Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
- Disable construction of Index when
freqis set in pandas-compatibility mode (#13857) @galipremsagar - Fix an issue with fetching
NAfrom aTimedeltaColumn(#13853) @galipremsagar - Simplify implementation of interval_range() and fix behaviour for floating
freq(#13844) @shwina - Fix binary operations between
SeriesandIndex(#13842) @galipremsagar - Update makelistscolumnfromscalar to use makeoffsetschild_column utility (#13841) @davidwendt
- Fix read out of bounds in string concatenate (#13838) @pentschev
- Raise error for more cases when
timezone-awaredata is passed toas_column(#13835) @galipremsagar - Fix
any,allreduction behavior foraxis=Noneand warn for other reductions (#13831) @galipremsagar - Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
- Fix cuFile I/O factories (#13829) @vuule
- DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
- Branch 23.10 merge 23.08 (#13822) @vyasr
- Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
- No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
- Raise error when mixed types are being constructed (#13816) @galipremsagar
- Fix unbounded sequence issue in
DataFrameconstructor (#13811) @galipremsagar - Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
- Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
- Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
- Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
- Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
- Raise error when trying to join
datetimeandtimedeltatypes with other types (#13786) @galipremsagar - Fix negative unary operation for boolean type (#13780) @galipremsagar
- Fix contains(
in) method forSeries(#13779) @galipremsagar - Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
- Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
- Preserve names of column object in various APIs (#13772) @galipremsagar
- Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
- Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
- Provide our own Cython declaration for make_unique (#13746) @wence-
π Documentation
- Fix typo in docstring: metadata. (#14025) @bdice
- Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
- Simplify Python doc configuration (#13826) @vyasr
- Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
- Fix all warnings in Python docs (#13789) @vyasr
π New Features
- [Java] Add JNI bindings for
integers_to_hex(#14205) @razajafri - Propagate errors from Parquet reader kernels back to host (#14167) @vuule
- JNI for
HISTOGRAMandMERGE_HISTOGRAMaggregations (#14154) @ttnghia - Expose streams in all public sorting APIs (#14146) @vyasr
- Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
- Implement
GroupBy.value_countsto match pandas API (#14114) @stmio - Refactor parquet thrift reader (#14097) @etseidl
- Refactor
hash_reduce_by_row(#14095) @ttnghia - Support negative preceding/following for ROW window functions (#14093) @mythrocks
- Support for progressive parquet chunked reading. (#14079) @nvdbaranec
- Implement
HISTOGRAMandMERGE_HISTOGRAMaggregations (#14045) @ttnghia - Expose streams in public search APIs (#14034) @vyasr
- Expose streams in public replace APIs (#14010) @vyasr
- Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
- Expose streams in public filling APIs (#13990) @vyasr
- Expose streams in public concatenate APIs (#13987) @vyasr
- Use HostMemoryAllocator in jni::allocatehostbuffer (#13975) @gerashegalov
- Enable fractional null probability for hashing benchmark (#13967) @Blonck
- Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
- Add nvtext::tokenizewithvocabulary API (#13930) @davidwendt
- Rewrite
DataFrame.stackto support multi level column names (#13927) @isVoid - Add HostMemoryAllocator interface (#13924) @gerashegalov
- Global stream pool (#13922) @etseidl
- Create tableinputmetadata from a table_metadata (#13920) @etseidl
- Translate column size overflow exception to JNI (#13911) @mythrocks
- Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
- Exclude some tests from running with the compute sanitizer (#13872) @firestarman
- Expand statistics support in ORC writer (#13848) @vuule
- Register the memory mapped buffer in
datasourceto improve H2D throughput (#13814) @vuule - Add cudf::strings::find function with target per row (#13808) @davidwendt
- Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
- Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
- Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
- Support
corrinGroupBy.applythrough the jit engine (#13767) @shwina - Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
- Support more numeric types in
Groupby.applywithengine='jit'(#13729) @brandon-b-miller - [FEA] Add DELTABINARYPACKED decoding support to Parquet reader (#13637) @etseidl
- Read FIXEDLENBYTE_ARRAY as binary in parquet reader (#13437) @PointKernel
π οΈ Improvements
- Update
shared-action-workflowsreferences (backport from23.12to23.10) (#14300) @AyodeAwe - Pin
daskanddistributedfor23.10release (#14225) @galipremsagar - update rmm tag path (#14195) @AyodeAwe
- Disable
Recently UpdatedCheck (#14193) @ajschmidt8 - Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
- Add Parquet reader benchmarks for row selection (#14147) @vuule
- Update image names (#14145) @AyodeAwe
- Support callables in DataFrame.assign (#14142) @wence-
- Reduce memory usage of ascategoricalcolumn (#14138) @wence-
- Replace Python scalar conversions with libcudf (#14124) @vyasr
- Update to clang 16.0.6. (#14120) @bdice
- Fix type of empty
Indexand raise warning inSeriesconstructor (#14116) @galipremsagar - Add stream parameter to external dict APIs (#14115) @SurajAralihalli
- Add fallback matrix for nvcomp. (#14082) @bdice
- [Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
- Remove header tests (#14072) @ajschmidt8
- Refactor
contains_tablewith cuco::static_set (#14064) @PointKernel - Remove debug print in a Parquet test (#14063) @vuule
- Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
- Expose stream parameter in public strings find APIs (#14060) @davidwendt
- Update doxygen to 1.9.1 (#14059) @vyasr
- Remove the mr from the base fixture (#14057) @vyasr
- Expose streams in public strings case APIs (#14056) @davidwendt
- Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
- Use cudf::makeemptycolumn instead of column_view constructor (#14030) @davidwendt
- Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
- Explicitly depend on zlib in conda recipes (#14018) @wence-
- Use grid_stride for stride computations. (#13996) @bdice
- Fix an issue where casting null-array to
objectdtype will result in a failure (#13994) @galipremsagar - Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
- Enable
codesdtype parity in pandas-compatibility mode forfactorizeAPI (#13982) @galipremsagar - Fix
CategoricalIndexordering inGroupby.aggwhen pandas-compatibility mode is enabled (#13978) @galipremsagar - Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
- Use
thread_index_typeinpartitioning.cu(#13973) @divyegala - Use
cudf::thread_index_typeinmerge.cu(#13972) @divyegala - Use
copy-pr-bot(#13970) @ajschmidt8 - Use cudf::threadindextype in strings custom kernels (#13968) @davidwendt
- Add
bytes_per_secondto hash_partition benchmark (#13965) @Blonck - Added pinned pool reservation API for java (#13964) @revans2
- Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
- Add
bytes_per_secondto copyifelse benchmark (#13960) @Blonck - Add pandas compatible output to
Series.unique(#13959) @galipremsagar - Add
bytes_per_secondto compiled binaryop benchmark (#13938) @Blonck - Unpin
daskanddistributedfor23.10development (#13935) @galipremsagar - Make HostColumnVector.getRefCount public (#13934) @abellina
- Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
- Add java API to get size of host memory needed to copy column view (#13919) @revans2
- Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
- Enable hugepage for arrow host allocations (#13914) @madsbk
- Improve performance of nvtext::edit_distance (#13912) @davidwendt
- Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
- Use
empty()instead ofsize()where possible (#13908) @vuule - [JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
- Return
Timestamp&Timedeltafor fetching scalars inDatetimeIndex&TimedeltaIndex(#13896) @galipremsagar - Allow explicit
shuffle="p2p"within dask-cudf API (#13893) @rjzamora - Disable creation of
DatetimeIndexwhenfreqis passed tocudf.date_range(#13890) @galipremsagar - Bring parity with pandas for
datetime&timedeltacomparison operations (#13877) @galipremsagar - Change
NAtoNaTfordatetimeandtimedeltatypes (#13868) @galipremsagar - Raise error when
astype(object)is called in pandas compatibility mode (#13862) @galipremsagar - Fixes a performance regression in FST (#13850) @elstehle
- Set native handles to null on close in Java wrapper classes (#13818) @jlowe
- Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
- Update
lists::containsto experimental row comparator (#13810) @divyegala - Reduce
lists::containsdispatches for scalars (#13805) @divyegala - Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
- Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
- Remove the libcudf cudf::offset_type type (#13788) @davidwendt
- Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
- Update to Cython 3.0.0 (#13777) @vyasr
- Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
- Branch 23.10 merge 23.08 (#13773) @vyasr
- Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
- Branch 23.10 merge 23.08 (#13753) @vyasr
- Enforce deprecations in
23.10(#13732) @galipremsagar - Upgrade to arrow 12 (#13728) @galipremsagar
- Refactors JSON reader's pushdown automaton (#13716) @elstehle
- Remove Arrow dependency from the
datasource.hpppublic header (#13698) @vuule
- C++
Published by rapids-bot[bot] over 2 years ago
https://github.com/rapidsai/cudf - v23.10.00
π¨ Breaking Changes
- Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
- Raise
MixedTypeErrorwhen a column of mixed-dtype is being constructed (#14050) @galipremsagar - Raise
NotImplementedErrorforMultiIndex.to_series(#14049) @galipremsagar - Create tableinputmetadata from a table_metadata (#13920) @etseidl
- Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
- Change
NAtoNaTfordatetimeandtimedeltatypes (#13868) @galipremsagar - Fix
any,allreduction behavior foraxis=Noneand warn for other reductions (#13831) @galipremsagar - Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
- Remove the libcudf cudf::offset_type type (#13788) @davidwendt
- Raise error when trying to join
datetimeandtimedeltatypes with other types (#13786) @galipremsagar - Update to Cython 3.0.0 (#13777) @vyasr
- Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
- Enforce deprecations in
23.10(#13732) @galipremsagar - Upgrade to arrow 12 (#13728) @galipremsagar
- Remove Arrow dependency from the
datasource.hpppublic header (#13698) @vuule
π Bug Fixes
- Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
- Fix inaccuracy in decimal128 rounding. (#14233) @bdice
- Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
- Fix pytorch related pytest (#14198) @galipremsagar
- Pin to
aws-sdk-cpp<1.11(#14173) @pentschev - Fix assert failure for range window functions (#14168) @mythrocks
- Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
- Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
- Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
- Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
- Fix kernel launch error for cudf::io::orc::gpu::rowgroupcharcounts_kernel (#14139) @davidwendt
- Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
- Fix DataFrame.values with no columns but index (#14134) @mroeschke
- Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
- Add support for nested dict in
DataFrameconstructor (#14119) @galipremsagar - Restrict iterables of
DataFrame's as input toDataFrameconstructor (#14118) @galipremsagar - Allow
numeric_only=Truefor reduction operations on numeric types (#14111) @galipremsagar - Preserve name of the column while initializing a
DataFrame(#14110) @galipremsagar - Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
- Drop
kwargsfromSeries.count(#14106) @galipremsagar - Fix naming issues with
Index.to_frameandMultiIndex.to_frameAPIs (#14105) @galipremsagar - Only use memory resources that haven't been freed (#14103) @robertmaynard
- Add support for
__round__inSeriesandDataFrame(#14099) @galipremsagar - Validate ignoreindex type in dropduplicates (#14098) @mroeschke
- Fix renaming
SeriesandIndex(#14080) @galipremsagar - Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
- Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
- Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
- Use
conda mambabuildrather thanmamba mambabuild(#14067) @wence- - Raise NotImplementedError in todatetime with dayfirst without inferformat (#14058) @mroeschke
- Fix various issues in
Index.intersection(#14054) @galipremsagar - Fix
Index.differenceto match with pandas (#14053) @galipremsagar - Fix empty string column construction (#14052) @galipremsagar
- Fix
IntervalIndex.unionto preserve type-metadata (#14051) @galipremsagar - Raise
MixedTypeErrorwhen a column of mixed-dtype is being constructed (#14050) @galipremsagar - Raise
NotImplementedErrorforMultiIndex.to_series(#14049) @galipremsagar - Ignore compile_commands.json (#14048) @harrism
- Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
- Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
- Implement
sort_remainingforsort_index(#14033) @wence- - Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
- Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
- Preserve types of scalar being returned when possible in
quantile(#14014) @galipremsagar - Fix return type of
MultiIndex.difference(#14009) @galipremsagar - Raise an error when timezone subtypes are encountered in
pd.IntervalDtype(#14006) @galipremsagar - Fix map column can not be non-nullable for java (#14003) @res-life
- Fix
nameselection inIndex.differenceandIndex.intersection(#13986) @galipremsagar - Restore column type metadata with
dropnato fixfactorizeAPI (#13980) @galipremsagar - Use threadindextype to avoid out of bounds accesses in conditional joins (#13971) @vyasr
- Fix
MultiIndex.to_numpyto return numpy array with tuples (#13966) @galipremsagar - Use cudf::threadindextype in getjsonobject and tdigest kernels (#13962) @nvdbaranec
- Fix an issue with
IntervalIndex.reprwhen null values are present (#13958) @galipremsagar - Fix type metadata issue preservation with
Column.unique(#13957) @galipremsagar - Handle
Intervalscalars when passed in list-like inputs tocudf.Index(#13956) @galipremsagar - Fix setting of categories order when
dtypeis passed to aCategoricalColumn(#13955) @galipremsagar - Handle
as_indexinGroupBy.apply(#13951) @brandon-b-miller - Raise error for string types in
nsmallestandnlargest(#13946) @galipremsagar - Fix
indexofGroupby.applyresults when it is performed on empty objects (#13944) @galipremsagar - Fix integer overflow in shim
device_sumfunctions (#13943) @brandon-b-miller - Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
- Fixed processed bytes calculation in APPLYBOOLEANMASK benchmark. (#13937) @Blonck
- Fix construction of
Groupingobjects (#13932) @galipremsagar - Fix an issue with
locwhen column names isMultiIndex(#13929) @galipremsagar - Fix handling of typecasting in
searchsorted(#13925) @galipremsagar - Preserve index
nameinreindex(#13917) @galipremsagar - Use
cudf::thread_index_typein cuIO to prevent overflow in row indexing (#13910) @vuule - Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
- Use cudf::threadindextype in concatenate.cu. (#13906) @bdice
- Use cudf::threadindextype in replace.cu. (#13905) @bdice
- Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
- Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
- Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
- Use
thread_index_typeto avoid index overflow in grid-stride loops (#13895) @PointKernel - Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
- Raise error when trying to construct a
DataFramewith mixed types (#13889) @galipremsagar - Return
nanwhen one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller - Correctly detect the BOM mark in
read_csvwith compressed input (#13881) @vuule - Check for the presence of all values in
MultiIndex.isin(#13879) @galipremsagar - Fix nvtext::generatecharacterngrams performance regression for longer strings (#13874) @davidwendt
- Fix return type of
MultiIndex.levels(#13870) @galipremsagar - Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
- Disable construction of Index when
freqis set in pandas-compatibility mode (#13857) @galipremsagar - Fix an issue with fetching
NAfrom aTimedeltaColumn(#13853) @galipremsagar - Simplify implementation of interval_range() and fix behaviour for floating
freq(#13844) @shwina - Fix binary operations between
SeriesandIndex(#13842) @galipremsagar - Update makelistscolumnfromscalar to use makeoffsetschild_column utility (#13841) @davidwendt
- Fix read out of bounds in string concatenate (#13838) @pentschev
- Raise error for more cases when
timezone-awaredata is passed toas_column(#13835) @galipremsagar - Fix
any,allreduction behavior foraxis=Noneand warn for other reductions (#13831) @galipremsagar - Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
- Fix cuFile I/O factories (#13829) @vuule
- DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
- Branch 23.10 merge 23.08 (#13822) @vyasr
- Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
- No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
- Raise error when mixed types are being constructed (#13816) @galipremsagar
- Fix unbounded sequence issue in
DataFrameconstructor (#13811) @galipremsagar - Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
- Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
- Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
- Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
- Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
- Raise error when trying to join
datetimeandtimedeltatypes with other types (#13786) @galipremsagar - Fix negative unary operation for boolean type (#13780) @galipremsagar
- Fix contains(
in) method forSeries(#13779) @galipremsagar - Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
- Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
- Preserve names of column object in various APIs (#13772) @galipremsagar
- Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
- Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
- Provide our own Cython declaration for make_unique (#13746) @wence-
π Documentation
- Fix typo in docstring: metadata. (#14025) @bdice
- Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
- Simplify Python doc configuration (#13826) @vyasr
- Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
- Fix all warnings in Python docs (#13789) @vyasr
π New Features
- [Java] Add JNI bindings for
integers_to_hex(#14205) @razajafri - Propagate errors from Parquet reader kernels back to host (#14167) @vuule
- JNI for
HISTOGRAMandMERGE_HISTOGRAMaggregations (#14154) @ttnghia - Expose streams in all public sorting APIs (#14146) @vyasr
- Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
- Implement
GroupBy.value_countsto match pandas API (#14114) @stmio - Refactor parquet thrift reader (#14097) @etseidl
- Refactor
hash_reduce_by_row(#14095) @ttnghia - Support negative preceding/following for ROW window functions (#14093) @mythrocks
- Support for progressive parquet chunked reading. (#14079) @nvdbaranec
- Implement
HISTOGRAMandMERGE_HISTOGRAMaggregations (#14045) @ttnghia - Expose streams in public search APIs (#14034) @vyasr
- Expose streams in public replace APIs (#14010) @vyasr
- Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
- Expose streams in public filling APIs (#13990) @vyasr
- Expose streams in public concatenate APIs (#13987) @vyasr
- Use HostMemoryAllocator in jni::allocatehostbuffer (#13975) @gerashegalov
- Enable fractional null probability for hashing benchmark (#13967) @Blonck
- Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
- Add nvtext::tokenizewithvocabulary API (#13930) @davidwendt
- Rewrite
DataFrame.stackto support multi level column names (#13927) @isVoid - Add HostMemoryAllocator interface (#13924) @gerashegalov
- Global stream pool (#13922) @etseidl
- Create tableinputmetadata from a table_metadata (#13920) @etseidl
- Translate column size overflow exception to JNI (#13911) @mythrocks
- Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
- Exclude some tests from running with the compute sanitizer (#13872) @firestarman
- Expand statistics support in ORC writer (#13848) @vuule
- Register the memory mapped buffer in
datasourceto improve H2D throughput (#13814) @vuule - Add cudf::strings::find function with target per row (#13808) @davidwendt
- Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
- Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
- Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
- Support
corrinGroupBy.applythrough the jit engine (#13767) @shwina - Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
- Support more numeric types in
Groupby.applywithengine='jit'(#13729) @brandon-b-miller - [FEA] Add DELTABINARYPACKED decoding support to Parquet reader (#13637) @etseidl
- Read FIXEDLENBYTE_ARRAY as binary in parquet reader (#13437) @PointKernel
π οΈ Improvements
- Pin
daskanddistributedfor23.10release (#14225) @galipremsagar - update rmm tag path (#14195) @AyodeAwe
- Disable
Recently UpdatedCheck (#14193) @ajschmidt8 - Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
- Add Parquet reader benchmarks for row selection (#14147) @vuule
- Update image names (#14145) @AyodeAwe
- Support callables in DataFrame.assign (#14142) @wence-
- Reduce memory usage of ascategoricalcolumn (#14138) @wence-
- Replace Python scalar conversions with libcudf (#14124) @vyasr
- Update to clang 16.0.6. (#14120) @bdice
- Fix type of empty
Indexand raise warning inSeriesconstructor (#14116) @galipremsagar - Add stream parameter to external dict APIs (#14115) @SurajAralihalli
- Add fallback matrix for nvcomp. (#14082) @bdice
- [Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
- Remove header tests (#14072) @ajschmidt8
- Refactor
contains_tablewith cuco::static_set (#14064) @PointKernel - Remove debug print in a Parquet test (#14063) @vuule
- Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
- Expose stream parameter in public strings find APIs (#14060) @davidwendt
- Update doxygen to 1.9.1 (#14059) @vyasr
- Remove the mr from the base fixture (#14057) @vyasr
- Expose streams in public strings case APIs (#14056) @davidwendt
- Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
- Use cudf::makeemptycolumn instead of column_view constructor (#14030) @davidwendt
- Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
- Explicitly depend on zlib in conda recipes (#14018) @wence-
- Use grid_stride for stride computations. (#13996) @bdice
- Fix an issue where casting null-array to
objectdtype will result in a failure (#13994) @galipremsagar - Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
- Enable
codesdtype parity in pandas-compatibility mode forfactorizeAPI (#13982) @galipremsagar - Fix
CategoricalIndexordering inGroupby.aggwhen pandas-compatibility mode is enabled (#13978) @galipremsagar - Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
- Use
thread_index_typeinpartitioning.cu(#13973) @divyegala - Use
cudf::thread_index_typeinmerge.cu(#13972) @divyegala - Use
copy-pr-bot(#13970) @ajschmidt8 - Use cudf::threadindextype in strings custom kernels (#13968) @davidwendt
- Add
bytes_per_secondto hash_partition benchmark (#13965) @Blonck - Added pinned pool reservation API for java (#13964) @revans2
- Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
- Add
bytes_per_secondto copyifelse benchmark (#13960) @Blonck - Add pandas compatible output to
Series.unique(#13959) @galipremsagar - Add
bytes_per_secondto compiled binaryop benchmark (#13938) @Blonck - Unpin
daskanddistributedfor23.10development (#13935) @galipremsagar - Make HostColumnVector.getRefCount public (#13934) @abellina
- Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
- Add java API to get size of host memory needed to copy column view (#13919) @revans2
- Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
- Enable hugepage for arrow host allocations (#13914) @madsbk
- Improve performance of nvtext::edit_distance (#13912) @davidwendt
- Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
- Use
empty()instead ofsize()where possible (#13908) @vuule - [JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
- Return
Timestamp&Timedeltafor fetching scalars inDatetimeIndex&TimedeltaIndex(#13896) @galipremsagar - Allow explicit
shuffle="p2p"within dask-cudf API (#13893) @rjzamora - Disable creation of
DatetimeIndexwhenfreqis passed tocudf.date_range(#13890) @galipremsagar - Bring parity with pandas for
datetime&timedeltacomparison operations (#13877) @galipremsagar - Change
NAtoNaTfordatetimeandtimedeltatypes (#13868) @galipremsagar - Raise error when
astype(object)is called in pandas compatibility mode (#13862) @galipremsagar - Fixes a performance regression in FST (#13850) @elstehle
- Set native handles to null on close in Java wrapper classes (#13818) @jlowe
- Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
- Update
lists::containsto experimental row comparator (#13810) @divyegala - Reduce
lists::containsdispatches for scalars (#13805) @divyegala - Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
- Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
- Remove the libcudf cudf::offset_type type (#13788) @davidwendt
- Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
- Update to Cython 3.0.0 (#13777) @vyasr
- Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
- Branch 23.10 merge 23.08 (#13773) @vyasr
- Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
- Branch 23.10 merge 23.08 (#13753) @vyasr
- Enforce deprecations in
23.10(#13732) @galipremsagar - Upgrade to arrow 12 (#13728) @galipremsagar
- Refactors JSON reader's pushdown automaton (#13716) @elstehle
- Remove Arrow dependency from the
datasource.hpppublic header (#13698) @vuule
- C++
Published by raydouglass over 2 years ago
https://github.com/rapidsai/cudf - [NIGHTLY] v23.12.00
π Links
π¨ Breaking Changes
- Expose stream parameter in public strings convert APIs (#14255) @davidwendt
π Bug Fixes
- Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
- Fix memset error in nvtext::editdistancematrix (#14283) @davidwendt
- Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
- Handle empty string correctly in Parquet statistics (#14257) @etseidl
- Fixes behaviour for incomplete lines when
recover_with_nullsis enabled (#14252) @elstehle - cudf::detail::pinned_allocator doesn't throw from
deallocate(#14251) @robertmaynard - Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
- Fixing parquet list of struct interpretation (#13715) @hyperbolic2346
π New Features
- Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
- Expose streams in public null mask APIs (#14263) @vyasr
- Expose streams in binaryop APIs (#14187) @vyasr
- Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
- Add DELTABINARYPACKED encoder for Parquet writer (#14100) @etseidl
π οΈ Improvements
- Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
- Update
shared-action-workflowsreferences (#14289) @AyodeAwe - Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
- Use branch-23.12 workflows. (#14271) @bdice
- Refactor LogicalType for Parquet (#14264) @etseidl
- Centralize chunked reading code in the parquet reader to readerimplchunking.cu (#14262) @nvdbaranec
- Expose stream parameter in public strings replace APIs (#14261) @davidwendt
- Expose stream parameter in public strings APIs (#14260) @davidwendt
- Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
- Make parquet schema index type consistent (#14256) @hyperbolic2346
- Expose stream parameter in public strings convert APIs (#14255) @davidwendt
- Add in java bindings for DataSource (#14254) @revans2
- Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
- Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
- Improve
contains_columnby invokingcontains_table(#14238) @PointKernel - Detect and report errors in Parquet header parsing (#14237) @etseidl
- Forward merge
23.10into23.12(#14231) @galipremsagar - Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
- Enable indexalator for device code (#14206) @davidwendt
- Marginally reduce memory footprint of joins (#14197) @wence-
- Add nvtx annotations to spilling-based data movement (#14196) @wence-
- Remove the use of volatile in ORC (#14175) @vuule
- Add
bytes_per_secondto distinctcount of streamcompaction nvbench. (#14172) @Blonck - Add
bytes_per_secondto transpose benchmark (#14170) @Blonck - cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
- Add
bytes_per_secondto shift benchmark (#13950) @Blonck
- C++
Published by rapids-bot[bot] over 2 years ago
https://github.com/rapidsai/cudf - v23.08.00
π¨ Breaking Changes
- Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
- Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
- Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
- Expose streams in all public copying APIs (#13629) @vyasr
- Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
- Remove deprecated cudf.set_allocator. (#13591) @bdice
- Change build.sh to use pip install instead of setup.py (#13507) @vyasr
- Remove unused maxrowstensor parameter from subword tokenizer (#13463) @davidwendt
- Fix decimal scale reductions in
_get_decimal_type(#13224) @charlesbluca
π Bug Fixes
- Add CUDA version to cudf_kafka and libcudf-example build strings. (#13769) @bdice
- Fix typo in wheels-test.yaml. (#13763) @bdice
- Don't test strings shorter than the requested ngram size (#13758) @vyasr
- Add CUDA version to custreamz build string. (#13754) @bdice
- Fix writing of ORC files with empty child string columns (#13745) @vuule
- Remove the erroneous "empty level" short-circuit from ORC reader (#13722) @vuule
- Fix character counting when writing sliced tables into ORC (#13721) @vuule
- Parquet uses row group row count if missing from header (#13712) @hyperbolic2346
- Fix reading of RLE encoded boolean data from parquet files with V2 page headers (#13707) @etseidl
- Fix a corner case of list lexicographic comparator (#13701) @ttnghia
- Fix combined filtering and column projection in
dask_cudf.read_parquet(#13697) @rjzamora - Revert fetch-rapids changes (#13696) @vyasr
- Data generator - include offsets in the size estimate of list elments (#13688) @vuule
- Add
cuda-nvcc-impltocudffornumbaCUDA 12 (#13673) @jakirkham - Fix combined filtering and column projection in
read_parquet(#13666) @rjzamora - Use
thrust::identityas hash functions for byte pair encoding (#13665) @PointKernel - Fix loc-getitem ordering when index contains duplicate labels (#13659) @wence-
- [REVIEW] Introduce parity with pandas for
MultiIndex.locordering & fix a bug inGroupbywithas_index(#13657) @galipremsagar - Fix memcheck error found in nvtext tokenize functions (#13649) @davidwendt
- Fix
has_nonempty_nullsignoring column offset (#13647) @ttnghia - [Java] Avoid double-free corruption in case of an Exception while creating a ColumnView (#13645) @razajafri
- Fix memcheck error in ORC reader call to cudf::io::copyuncompressedkernel (#13643) @davidwendt
- Fix CUDA 12 conda environment to remove cubinlinker and ptxcompiler. (#13636) @bdice
- Fix inf/NaN comparisons for FLOAT orderby in window functions (#13635) @mythrocks
- Refactor
Indexsearch to simplify code and increase correctness (#13625) @wence- - Fix compile warning for unused variable in split_re.cu (#13621) @davidwendt
- Fix tzlocalize for daskcudf Series (#13610) @shwina
- Fix issue with no decompressed data in ORC reader (#13609) @vuule
- Fix floating point window range extents. (#13606) @mythrocks
- Fix
localize(None)for timezone-naive columns (#13603) @shwina - Fixed a memory leak caused by Exception thrown while constructing a ColumnView (#13597) @razajafri
- Handle nullptr return value from bitmaskor in distinctcount (#13590) @wence-
- Bring parity with pandas in Index.join (#13589) @galipremsagar
- Fix cudf.melt when there are more than 255 columns (#13588) @hcho3
- Fix memory issues in cuIO due to removal of memory padding (#13586) @ttnghia
- Fix Parquet multi-file reading (#13584) @etseidl
- Fix memcheck error found in LISTS_TEST (#13579) @davidwendt
- Fix memcheck error found in STRINGS_TEST (#13578) @davidwendt
- Fix memcheck error found in INTEROP_TEST (#13577) @davidwendt
- Fix memcheck errors found in REDUCTION_TEST (#13574) @davidwendt
- Preemptive fix for hive-partitioning change in dask (#13564) @rjzamora
- Fix an issue with
dask_cudf.read_csvwhen lines are needed to be skipped (#13555) @galipremsagar - Fix out-of-bounds memory write in cudf::dictionary::detail::concatenate (#13554) @davidwendt
- Fix the null mask size in json reader (#13537) @karthikeyann
- Fix cudf::strings::strip for all-empty input column (#13533) @davidwendt
- Make sure to build without isolation or installing dependencies (#13524) @vyasr
- Remove preload lib from CMake for now (#13519) @vyasr
- Fix missing separator after null values in JSON writer (#13503) @karthikeyann
- Ensure
single_lane_block_sum_reduceis safe to call in a loop (#13488) @wence- - Update all versions in pyproject.toml files. (#13486) @bdice
- Remove applying nvbench that doesn't exist in 23.08 (#13484) @robertmaynard
- Fix chunked Parquet reader benchmark (#13482) @vuule
- Update JNI JSON reader column compatability for Spark (#13477) @revans2
- Fix unsanitized output of scan with strings (#13455) @davidwendt
- Reject functions without bytecode from
_can_be_jittedin GroupBy Apply (#13429) @brandon-b-miller - Fix decimal scale reductions in
_get_decimal_type(#13224) @charlesbluca
π Documentation
- Fix doxygen groups for io data sources and sinks (#13718) @davidwendt
- Add pandas compatibility note to DataFrame.query docstring (#13693) @beckernick
- Add pylibcudf to developer guide (#13639) @vyasr
- Fix repeated words in doxygen text (#13598) @karthikeyann
- Update docs for top-level API. (#13592) @bdice
- Fix the the doxygen text for cudf::concatenate and other places (#13561) @davidwendt
- Document stream validation approach used in testing (#13556) @vyasr
- Cleanup doc repetitions in libcudf (#13470) @karthikeyann
π New Features
- Support
minandmaxaggregations for list type in groupby and reduction (#13676) @ttnghia - Add nvtext::jaccard_index API for strings columns (#13669) @davidwendt
- Add readparquetmetadata libcudf API (#13663) @karthikeyann
- Expose streams in all public copying APIs (#13629) @vyasr
- Add XXHash_64 hash function to cudf (#13612) @davidwendt
- Java support: Floating point order-by columns for RANGE window functions (#13595) @mythrocks
- Use
cuco::static_mapto build string dictionaries in ORC writer (#13580) @vuule - Add pylibcudf subpackage with gather implementation (#13562) @vyasr
- Add JNI for
lists::concatenate_list_elements(#13547) @ttnghia - Enable nested types for
lists::concatenate_list_elements(#13545) @ttnghia - Add unicode encoding for string columns in JSON writer (#13539) @karthikeyann
- Remove numba kernels from
find_index_of_val(#13517) @brandon-b-miller - Floating point order-by columns for RANGE window functions (#13512) @mythrocks
- Parse column chunk metadata statistics in parquet reader (#13472) @karthikeyann
- Add
absfunction to apply (#13408) @brandon-b-miller - [FEA] AST filtering in parquet reader (#13348) @karthikeyann
- [FEA] Adds option to recover from invalid JSON lines in JSON tokenizer (#13344) @elstehle
- Ensure cccl packages don't clash with upstream version (#13235) @robertmaynard
- Update
struct_minmax_utilto experimental row comparator (#13069) @divyegala - Add stream parameter to hashing APIs (#12090) @vyasr
π οΈ Improvements
- Pin
daskanddistributedfor23.08release (#13802) @galipremsagar - Relax protobuf pinnings. (#13770) @bdice
- Switch fully unbounded window functions to use aggregations (#13727) @mythrocks
- Switch to new wheel building pipeline (#13723) @vyasr
- Revert CUDA 12.0 CI workflows to branch-23.08. (#13719) @bdice
- Adding identify minimum version requirement (#13713) @hyperbolic2346
- Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
- Optimize ORC reader performance for list data (#13708) @vyasr
- fix limit overflow message in a docstring (#13703) @ahmet-uyar
- Alleviates JSON parser's need for multi-file sources to end with a newline (#13702) @elstehle
- Update cython-lint and replace flake8 with ruff (#13699) @vyasr
- Add
__dask_tokenize__definitions to cudf classes (#13695) @rjzamora - Convert libcudf hashing benchmarks to nvbench (#13694) @davidwendt
- Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
- Improve performance of cudf::strings::split on whitespace (#13680) @davidwendt
- Allow ORC and Parquet writers to write nullable columns without nulls as non-nullable (#13675) @vuule
- Raise a NotImplementedError in to_datetime when utc is passed (#13670) @shwina
- Add rmm_mode parameter to nvbench base fixture (#13668) @davidwendt
- Fix multiindex loc ordering in pandas-compat mode (#13660) @wence-
- Add nvtext hashcharacterngrams function (#13654) @davidwendt
- Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
- Acquire spill lock in to/from_arrow (#13646) @shwina
- Expose stable versions of libcudf sort routines (#13634) @wence-
- Separate out hash_test.cpp source for each hash API (#13633) @davidwendt
- Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
- Create separate libcudf hash APIs for each supported hash function (#13626) @davidwendt
- Add convert_dtypes API (#13623) @shwina
- Clean up cupy in dependencies.yaml. (#13617) @bdice
- Use cuda-version to constrain cudatoolkit. (#13615) @bdice
- Add murmurhash3x64128 function to libcudf (#13604) @davidwendt
- Performance improvement for cudf::strings::like (#13594) @davidwendt
- Remove deprecated cudf.set_allocator. (#13591) @bdice
- Clean up cudf device atomic with
cuda::atomic_ref(#13583) @PointKernel - Add java bindings for distinct count (#13573) @revans2
- Use nvcomp conda package. (#13566) @bdice
- Add exception to stringscalar if input string exceeds sizetype (#13560) @davidwendt
- Add dispatch for
cudf.Dataframeto/frompyarrow.Tableconversion (#13558) @rjzamora - Get rid of
cuco::pair_typealiases (#13553) @PointKernel - Introduce parity with pandas when
sort=FalseinGroupby(#13551) @galipremsagar - Update CMake in docker to 3.26.4 (#13550) @NvTimLiu
- Clarify source of error message in stream testing. (#13541) @bdice
- Deprecate
strings_to_categoricalincudf.read_parquet(#13540) @galipremsagar - Update to CMake 3.26.4 (#13538) @vyasr
- s3 folder naming fix (#13536) @AyodeAwe
- Implement iloc-getitem using parse-don't-validate approach (#13534) @wence-
- Make synchronization explicit in the names of
hostdevice_*copying APIs (#13530) @ttnghia - Add benchmark (Google Benchmark) dependency to conda packages. (#13528) @bdice
- Add libcufile to dependencies.yaml. (#13523) @bdice
- Fix some memoization logic in groupby/sort/sort_helper.cu (#13521) @davidwendt
- Use sizestooffsets_iterator in cudf::gather for strings (#13520) @davidwendt
- use rapids-upload-docs script (#13518) @AyodeAwe
- Support UTF-8 BOM in CSV reader (#13516) @davidwendt
- Move stream-related test configuration to CMake (#13513) @vyasr
- Implement
cudf.option_context(#13511) @galipremsagar - Unpin
daskanddistributedfor development (#13508) @galipremsagar - Change build.sh to use pip install instead of setup.py (#13507) @vyasr
- Use test default stream (#13506) @vyasr
- Remove documentation build scripts for Jenkins (#13495) @ajschmidt8
- Use east const in include files (#13494) @karthikeyann
- Use east const in src files (#13493) @karthikeyann
- Use east const in tests files (#13492) @karthikeyann
- Use east const in benchmarks files (#13491) @karthikeyann
- Performance improvement for nvtext tokenize/token functions (#13480) @davidwendt
- Add pd.Float*Dtype to Avro and ORC mappings (#13475) @mroeschke
- Use pandas public APIs where available (#13467) @mroeschke
- Allow pd.ArrowDtype in cudf.from_pandas (#13465) @mroeschke
- Rework libcudf regex benchmarks with nvbench (#13464) @davidwendt
- Remove unused maxrowstensor parameter from subword tokenizer (#13463) @davidwendt
- Separate io-text and nvtext pytests into different files (#13435) @davidwendt
- Add a moveto function to cudf::stringview::const_iterator (#13428) @davidwendt
- Allow newer scikit-build (#13424) @vyasr
- Refactor sortbyvalues to sort_values, drop indices from return values. (#13419) @bdice
- Inline Cython exception handler (#13411) @vyasr
- Init JNI version 23.08.0-SNAPSHOT (#13401) @pxLi
- Refactor ORC reader (#13396) @ttnghia
- JNI: Remove cleaned objects in memory cleaner (#13378) @res-life
- Add tests of currently unsupported indexing (#13338) @wence-
- Performance improvement for some libcudf regex functions for long strings (#13322) @davidwendt
- Exposure Tracked Buffer (first step towards unifying copy-on-write and spilling) (#13307) @madsbk
- Write string data directly to column_buffer in Parquet reader (#13302) @etseidl
- Add stacktrace into cudf exception types (#13298) @ttnghia
- cuDF: Build CUDA 12 packages (#12922) @bdice
- C++
Published by raydouglass over 2 years ago
https://github.com/rapidsai/cudf - [NIGHTLY] v23.10.00
π Links
π¨ Breaking Changes
- Raise
MixedTypeErrorwhen a column of mixed-dtype is being constructed (#14050) @galipremsagar - Raise
NotImplementedErrorforMultiIndex.to_series(#14049) @galipremsagar - Create tableinputmetadata from a table_metadata (#13920) @etseidl
- Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
- Change
NAtoNaTfordatetimeandtimedeltatypes (#13868) @galipremsagar - Fix
any,allreduction behavior foraxis=Noneand warn for other reductions (#13831) @galipremsagar - Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
- Remove the libcudf cudf::offset_type type (#13788) @davidwendt
- Raise error when trying to join
datetimeandtimedeltatypes with other types (#13786) @galipremsagar - Update to Cython 3.0.0 (#13777) @vyasr
- Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
- Enforce deprecations in
23.10(#13732) @galipremsagar - Upgrade to arrow 12 (#13728) @galipremsagar
- Remove Arrow dependency from the
datasource.hpppublic header (#13698) @vuule
π Bug Fixes
- Fix kernel launch error for cudf::io::orc::gpu::rowgroupcharcounts_kernel (#14139) @davidwendt
- Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
- Fix DataFrame.values with no columns but index (#14134) @mroeschke
- Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
- Add support for nested dict in
DataFrameconstructor (#14119) @galipremsagar - Restrict iterables of
DataFrame's as input toDataFrameconstructor (#14118) @galipremsagar - Allow
numeric_only=Truefor reduction operations on numeric types (#14111) @galipremsagar - Drop
kwargsfromSeries.count(#14106) @galipremsagar - Fix naming issues with
Index.to_frameandMultiIndex.to_frameAPIs (#14105) @galipremsagar - Only use memory resources that haven't been freed (#14103) @robertmaynard
- Add support for
__round__inSeriesandDataFrame(#14099) @galipremsagar - Validate ignoreindex type in dropduplicates (#14098) @mroeschke
- Fix renaming
SeriesandIndex(#14080) @galipremsagar - Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
- Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
- Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
- Use
conda mambabuildrather thanmamba mambabuild(#14067) @wence- - Raise NotImplementedError in todatetime with dayfirst without inferformat (#14058) @mroeschke
- Fix various issues in
Index.intersection(#14054) @galipremsagar - Fix
Index.differenceto match with pandas (#14053) @galipremsagar - Fix empty string column construction (#14052) @galipremsagar
- Fix
IntervalIndex.unionto preserve type-metadata (#14051) @galipremsagar - Raise
MixedTypeErrorwhen a column of mixed-dtype is being constructed (#14050) @galipremsagar - Raise
NotImplementedErrorforMultiIndex.to_series(#14049) @galipremsagar - Ignore compile_commands.json (#14048) @harrism
- Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
- Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
- Implement
sort_remainingforsort_index(#14033) @wence- - Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
- Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
- Preserve types of scalar being returned when possible in
quantile(#14014) @galipremsagar - Fix return type of
MultiIndex.difference(#14009) @galipremsagar - Raise an error when timezone subtypes are encountered in
pd.IntervalDtype(#14006) @galipremsagar - Fix map column can not be non-nullable for java (#14003) @res-life
- Fix
nameselection inIndex.differenceandIndex.intersection(#13986) @galipremsagar - Restore column type metadata with
dropnato fixfactorizeAPI (#13980) @galipremsagar - Use threadindextype to avoid out of bounds accesses in conditional joins (#13971) @vyasr
- Fix
MultiIndex.to_numpyto return numpy array with tuples (#13966) @galipremsagar - Use cudf::threadindextype in getjsonobject and tdigest kernels (#13962) @nvdbaranec
- Fix an issue with
IntervalIndex.reprwhen null values are present (#13958) @galipremsagar - Fix type metadata issue preservation with
Column.unique(#13957) @galipremsagar - Handle
Intervalscalars when passed in list-like inputs tocudf.Index(#13956) @galipremsagar - Fix setting of categories order when
dtypeis passed to aCategoricalColumn(#13955) @galipremsagar - Handle
as_indexinGroupBy.apply(#13951) @brandon-b-miller - Raise error for string types in
nsmallestandnlargest(#13946) @galipremsagar - Fix
indexofGroupby.applyresults when it is performed on empty objects (#13944) @galipremsagar - Fix integer overflow in shim
device_sumfunctions (#13943) @brandon-b-miller - Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
- Fixed processed bytes calculation in APPLYBOOLEANMASK benchmark. (#13937) @Blonck
- Fix construction of
Groupingobjects (#13932) @galipremsagar - Fix an issue with
locwhen column names isMultiIndex(#13929) @galipremsagar - Fix handling of typecasting in
searchsorted(#13925) @galipremsagar - Preserve index
nameinreindex(#13917) @galipremsagar - Use
cudf::thread_index_typein cuIO to prevent overflow in row indexing (#13910) @vuule - Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
- Use cudf::threadindextype in concatenate.cu. (#13906) @bdice
- Use cudf::threadindextype in replace.cu. (#13905) @bdice
- Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
- Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
- Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
- Use
thread_index_typeto avoid index overflow in grid-stride loops (#13895) @PointKernel - Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
- Raise error when trying to construct a
DataFramewith mixed types (#13889) @galipremsagar - Return
nanwhen one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller - Correctly detect the BOM mark in
read_csvwith compressed input (#13881) @vuule - Check for the presence of all values in
MultiIndex.isin(#13879) @galipremsagar - Fix nvtext::generatecharacterngrams performance regression for longer strings (#13874) @davidwendt
- Fix return type of
MultiIndex.levels(#13870) @galipremsagar - Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
- Disable construction of Index when
freqis set in pandas-compatibility mode (#13857) @galipremsagar - Fix an issue with fetching
NAfrom aTimedeltaColumn(#13853) @galipremsagar - Simplify implementation of interval_range() and fix behaviour for floating
freq(#13844) @shwina - Fix binary operations between
SeriesandIndex(#13842) @galipremsagar - Update makelistscolumnfromscalar to use makeoffsetschild_column utility (#13841) @davidwendt
- Fix read out of bounds in string concatenate (#13838) @pentschev
- Raise error for more cases when
timezone-awaredata is passed toas_column(#13835) @galipremsagar - Fix
any,allreduction behavior foraxis=Noneand warn for other reductions (#13831) @galipremsagar - Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
- Fix cuFile I/O factories (#13829) @vuule
- DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
- Branch 23.10 merge 23.08 (#13822) @vyasr
- Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
- No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
- Raise error when mixed types are being constructed (#13816) @galipremsagar
- Fix unbounded sequence issue in
DataFrameconstructor (#13811) @galipremsagar - Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
- Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
- Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
- Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
- Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
- Raise error when trying to join
datetimeandtimedeltatypes with other types (#13786) @galipremsagar - Fix negative unary operation for boolean type (#13780) @galipremsagar
- Fix contains(
in) method forSeries(#13779) @galipremsagar - Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
- Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
- Preserve names of column object in various APIs (#13772) @galipremsagar
- Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
- Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
- Provide our own Cython declaration for make_unique (#13746) @wence-
π Documentation
- Fix typo in docstring: metadata. (#14025) @bdice
- Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
- Simplify Python doc configuration (#13826) @vyasr
- Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
- Fix all warnings in Python docs (#13789) @vyasr
π New Features
- Implement
GroupBy.value_countsto match pandas API (#14114) @stmio - Refactor parquet thrift reader (#14097) @etseidl
- Refactor
hash_reduce_by_row(#14095) @ttnghia - Support negative preceding/following for ROW window functions (#14093) @mythrocks
- Expose streams in public search APIs (#14034) @vyasr
- Expose streams in public replace APIs (#14010) @vyasr
- Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
- Expose streams in public filling APIs (#13990) @vyasr
- Expose streams in public concatenate APIs (#13987) @vyasr
- Use HostMemoryAllocator in jni::allocatehostbuffer (#13975) @gerashegalov
- Enable fractional null probability for hashing benchmark (#13967) @Blonck
- Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
- Rewrite
DataFrame.stackto support multi level column names (#13927) @isVoid - Add HostMemoryAllocator interface (#13924) @gerashegalov
- Global stream pool (#13922) @etseidl
- Create tableinputmetadata from a table_metadata (#13920) @etseidl
- Translate column size overflow exception to JNI (#13911) @mythrocks
- Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
- Exclude some tests from running with the compute sanitizer (#13872) @firestarman
- Expand statistics support in ORC writer (#13848) @vuule
- Register the memory mapped buffer in
datasourceto improve H2D throughput (#13814) @vuule - Add cudf::strings::find function with target per row (#13808) @davidwendt
- Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
- Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
- Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
- Support
corrinGroupBy.applythrough the jit engine (#13767) @shwina - Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
- Support more numeric types in
Groupby.applywithengine='jit'(#13729) @brandon-b-miller - [FEA] Add DELTABINARYPACKED decoding support to Parquet reader (#13637) @etseidl
- Read FIXEDLENBYTE_ARRAY as binary in parquet reader (#13437) @PointKernel
π οΈ Improvements
- Reduce memory usage of ascategoricalcolumn (#14138) @wence-
- Update to clang 16.0.6. (#14120) @bdice
- Fix type of empty
Indexand raise warning inSeriesconstructor (#14116) @galipremsagar - Add fallback matrix for nvcomp. (#14082) @bdice
- [Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
- Remove header tests (#14072) @ajschmidt8
- Remove debug print in a Parquet test (#14063) @vuule
- Expose stream parameter in public strings find APIs (#14060) @davidwendt
- Update doxygen to 1.9.1 (#14059) @vyasr
- Remove the mr from the base fixture (#14057) @vyasr
- Expose streams in public strings case APIs (#14056) @davidwendt
- Use cudf::makeemptycolumn instead of column_view constructor (#14030) @davidwendt
- Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
- Explicitly depend on zlib in conda recipes (#14018) @wence-
- Use grid_stride for stride computations. (#13996) @bdice
- Fix an issue where casting null-array to
objectdtype will result in a failure (#13994) @galipremsagar - Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
- Enable
codesdtype parity in pandas-compatibility mode forfactorizeAPI (#13982) @galipremsagar - Fix
CategoricalIndexordering inGroupby.aggwhen pandas-compatibility mode is enabled (#13978) @galipremsagar - Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
- Use
thread_index_typeinpartitioning.cu(#13973) @divyegala - Use
cudf::thread_index_typeinmerge.cu(#13972) @divyegala - Use
copy-pr-bot(#13970) @ajschmidt8 - Use cudf::threadindextype in strings custom kernels (#13968) @davidwendt
- Add
bytes_per_secondto hash_partition benchmark (#13965) @Blonck - Added pinned pool reservation API for java (#13964) @revans2
- Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
- Add
bytes_per_secondto copyifelse benchmark (#13960) @Blonck - Add pandas compatible output to
Series.unique(#13959) @galipremsagar - Add
bytes_per_secondto compiled binaryop benchmark (#13938) @Blonck - Unpin
daskanddistributedfor23.10development (#13935) @galipremsagar - Make HostColumnVector.getRefCount public (#13934) @abellina
- Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
- Add java API to get size of host memory needed to copy column view (#13919) @revans2
- Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
- Enable hugepage for arrow host allocations (#13914) @madsbk
- Improve performance of nvtext::edit_distance (#13912) @davidwendt
- Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
- Use
empty()instead ofsize()where possible (#13908) @vuule - [JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
- Return
Timestamp&Timedeltafor fetching scalars inDatetimeIndex&TimedeltaIndex(#13896) @galipremsagar - Disable creation of
DatetimeIndexwhenfreqis passed tocudf.date_range(#13890) @galipremsagar - Bring parity with pandas for
datetime&timedeltacomparison operations (#13877) @galipremsagar - Change
NAtoNaTfordatetimeandtimedeltatypes (#13868) @galipremsagar - Raise error when
astype(object)is called in pandas compatibility mode (#13862) @galipremsagar - Fixes a performance regression in FST (#13850) @elstehle
- Set native handles to null on close in Java wrapper classes (#13818) @jlowe
- Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
- Update
lists::containsto experimental row comparator (#13810) @divyegala - Reduce
lists::containsdispatches for scalars (#13805) @divyegala - Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
- Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
- Remove the libcudf cudf::offset_type type (#13788) @davidwendt
- Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
- Update to Cython 3.0.0 (#13777) @vyasr
- Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
- Branch 23.10 merge 23.08 (#13773) @vyasr
- Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
- Branch 23.10 merge 23.08 (#13753) @vyasr
- Enforce deprecations in
23.10(#13732) @galipremsagar - Upgrade to arrow 12 (#13728) @galipremsagar
- Refactors JSON reader's pushdown automaton (#13716) @elstehle
- Remove Arrow dependency from the
datasource.hpppublic header (#13698) @vuule
- C++
Published by rapids-bot[bot] over 2 years ago
https://github.com/rapidsai/cudf - [NIGHTLY] v23.06.00
π Links
π¨ Breaking Changes
- Fix batch processing for parquet writer (#13438) @ttnghia
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Remove UNKNOWNNULLCOUNT (#13372) @vyasr
- Remove default UNKNOWNNULLCOUNT from cudf::column member functions (#13341) @davidwendt
- Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
- Remove null mask and null count from column_view constructors (#13311) @vyasr
- Change default value of the
observed=argument in groupby toTrueto reflect the actual behaviour (#13296) @shwina - Throw error if UNINITIALIZED is passed to cudf::statenullcount (#13292) @davidwendt
- Remove default null-count parameter from cudf::makestringscolumn factory (#13227) @davidwendt
- Remove UNKNOWNNULLCOUNT where it can be easily computed (#13205) @vyasr
- Update minimum Python version to Python 3.9 (#13196) @shwina
- Refactor contiguoussplit API into contiguoussplit.hpp (#13186) @abellina
- Cleanup Parquet chunked writer (#13094) @ttnghia
- Cleanup ORC chunked writer (#13091) @ttnghia
- Raise
NotImplementedErrorwhen attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Remove deprecated regex functions from libcudf (#13067) @davidwendt
- [REVIEW] Upgrade to
arrow-11(#12757) @galipremsagar - Implement Python dropduplicates with cudf::stabledistinct. (#11656) @brandon-b-miller
π Bug Fixes
- Fix valid count computation in offsetbitmaskbinop kernel (#13489) @davidwendt
- Fix writing of ORC files with empty rowgroups (#13466) @vuule
- Fix cudf::repeat logic when count is zero (#13459) @davidwendt
- Fix batch processing for parquet writer (#13438) @ttnghia
- Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
- Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
- Fix cudf::strings::replacewithbackrefs hang on empty match result (#13418) @davidwendt
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Fix tokenize with non-space delimiter (#13403) @shwina
- Fix groupby head/tail for empty dataframe (#13398) @shwina
- Default to closed="right" in
IntervalIndexconstructor (#13394) @shwina - Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
- Fix unused argument errors in nvcc 11.5 (#13387) @abellina
- Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
- Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
- Fix page size estimation in Parquet writer (#13364) @etseidl
- Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
- Support gcc 12 as the C++ compiler (#13316) @robertmaynard
- Correctly set bitmask size in
from_column_view(#13315) @wence- - Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
- Fix parquet schema interpretation issue (#13277) @hyperbolic2346
- Fix 64bit shift bug in avro reader (#13276) @karthikeyann
- Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
- Clean up buffers in case AssertionError (#13262) @razajafri
- Allow empty input table in ast
compute_column(#13245) @wence- - Fix structscolumnwrapper constructors to copy input column wrappers (#13243) @davidwendt
- Fix the row index stream order in ORC reader (#13242) @vuule
- Make
is_decompression_disabledandis_compression_disabledthread-safe (#13240) @vuule - Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
- Fix race in ORC string dictionary creation (#13214) @revans2
- Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
- Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
- Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
- Fix
hostdevice_vector::subspan(#13187) @ttnghia - Use custom nvbench entry point to ensure
cudf::nvbench_base_fixtureusage (#13183) @robertmaynard - Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
- Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
- Fix a few clang-format style check errors (#13146) @davidwendt
- [REVIEW] Fix
SeriesandDataFrameconstructors to validate index lengths (#13122) @galipremsagar - Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
- Fix GPUARCHS setting in Java CMake build and CMAKECUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
- Adds checks to make sure json reader won't overflow (#13115) @elstehle
- Fix
null_countof columns returned bychunked_parquet_reader(#13111) @vuule - Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
- [REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
- Use makeemptylistscolumn instead of makeemptycolumn(typeid::LIST) (#13099) @davidwendt
- Raise
NotImplementedErrorwhen attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Fix column selection
read_parquetbenchmarks (#13082) @vuule - Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
- Add algorithm include in data_sink.hpp (#13068) @ahendriksen
- Fix tests/identifystreamusage.cpp (#13066) @ahendriksen
- Prevent overflow with
skip_rowsin ORC and Parquet readers (#13063) @vuule - Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
- [REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
- Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
- Fix readavro() skiprows and num_rows. (#12912) @tpn
- Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
- Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina
π New Features
- Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
- Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
- Use compileor_get in JIT groupby apply (#13350) @brandon-b-miller
- cuDF numba cuda 12 updates (#13337) @brandon-b-miller
- Add tz_convert method to convert between timestamps (#13328) @shwina
- Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
- Support the case=False argument to str.contains (#13290) @shwina
- Add an event handler for ColumnVector.close (#13279) @abellina
- JNI api for cudf::chunked_pack (#13278) @abellina
- Implement a chunked_pack API (#13260) @abellina
- Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
- JNI changes for range-extents in window functions. (#13199) @mythrocks
- Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
- Add IS_NULL operator to AST (#13145) @karthikeyann
- STRING order-by column for RANGE window functions (#13143) @mythrocks
- Update
contains_tableto experimental row hasher and equality comparator (#13119) @divyegala - Automatically select
GroupBy.applyalgorithm based on if the UDF is jittable (#13113) @brandon-b-miller - Refactor Parquet chunked writer (#13076) @ttnghia
- Add Python bindings for string literal support in AST (#13073) @karthikeyann
- Add Java bindings for string literal support in AST (#13072) @karthikeyann
- Add string scalar support in AST (#13061) @karthikeyann
- Log cuIO warnings using the libcudf logger (#13043) @vuule
- Update
mixed_jointo use experimental row hasher and comparator (#13028) @divyegala - Support structs of lists in row lexicographic comparator (#13005) @ttnghia
- Adding
hostdevice_spanthat is a span createable fromhostdevice_vector(#12981) @hyperbolic2346 - Add nvtext::minhash function (#12961) @davidwendt
- Support lists of structs in row lexicographic comparator (#12953) @ttnghia
- Update
jointo use experimental row hasher and comparator (#12787) @divyegala - Implement Python dropduplicates with cudf::stabledistinct. (#11656) @brandon-b-miller
π οΈ Improvements
- Bump typing_extensions minimum version to 4.0.0 (#13618) @shwina
- Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
- Handle some corner-cases in indexing with boolean masks (#13402) @wence-
- Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
- [JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
- Fix JNI method with mismatched parameter list (#13384) @ttnghia
- Split up experimentalrowoperator_tests.cu to improve its compile time (#13382) @davidwendt
- Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
- Remove UNKNOWNNULLCOUNT (#13372) @vyasr
- Move some nvtext benchmarks to nvbench (#13368) @davidwendt
- run docs nightly too (#13366) @AyodeAwe
- Add warning for default
dtypeparameter inget_dummies(#13365) @galipremsagar - Add log messages about kvikIO compatibility mode (#13363) @vuule
- Switch back to using primary shared-action-workflows branch (#13362) @vyasr
- Deprecate
StringIndexand useIndexinstead (#13361) @galipremsagar - Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
- Expunge most uses of
TypeVar(bound="Foo")(#13346) @wence- - Remove all references to UNKNOWNNULLCOUNT in Python (#13345) @vyasr
- Improve
distinct_countwithcuco::static_set(#13343) @PointKernel - Fix
contiguous_splitperformance (#13342) @ttnghia - Remove default UNKNOWNNULLCOUNT from cudf::column member functions (#13341) @davidwendt
- Update mypy to 1.3 (#13340) @wence-
- [Java] Purge non-empty nulls when setting validity (#13335) @razajafri
- Add row-wise filtering step to
read_parquet(#13334) @rjzamora - Performance improvement for nvtext::minhash (#13333) @davidwendt
- Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
- Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
- Move
metacalculation indask_cudf.read_parquet(#13327) @rjzamora - Changes to support Numpy >= 1.24 (#13325) @shwina
- Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
- Clean up
distinct_countbenchmark (#13321) @PointKernel - Fix gtest pinning to 1.13.0. (#13319) @bdice
- Remove null mask and null count from column_view constructors (#13311) @vyasr
- Address feedback from 13289 (#13306) @vyasr
- Change default value of the
observed=argument in groupby toTrueto reflect the actual behaviour (#13296) @shwina - First check for
BaseDtypewhen infering the data type of an arbitrary object (#13295) @shwina - Throw error if UNINITIALIZED is passed to cudf::statenullcount (#13292) @davidwendt
- Support CUDA 12.0 for pip wheels (#13289) @divyegala
- Refactor
transform_lists_of_structsinrow_operators.cu(#13288) @ttnghia - Branch 23.06 merge 23.04 (#13286) @vyasr
- Update cupy dependency (#13284) @vyasr
- Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
- Fix unused variables and functions (#13275) @karthikeyann
- Fix integer overflow in
partitionscatter_mapconstruction (#13272) @wence- - Numba 0.57 compatibility fixes (#13271) @gmarkall
- Performance improvement in cudf::strings::allcharactersof_type (#13259) @davidwendt
- Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
- Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
- Build wheels using new single image workflow (#13249) @vyasr
- Enable sccache hits from local builds (#13248) @AyodeAwe
- Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
- Introduce
pandas_compatibleoption incudf(#13241) @galipremsagar - Add metadata_builder helper class (#13232) @abellina
- Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
- Remove default null-count parameter from cudf::makestringscolumn factory (#13227) @davidwendt
- Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
- Add chunked reader benchmark (#13223) @SrikarVanavasam
- Set the null count in output columns in the CSV reader (#13221) @vuule
- Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
- Fix stringscalar stream usage in writejson.cu (#13212) @davidwendt
- Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
- Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
- Remove UNKNOWNNULLCOUNT where it can be easily computed (#13205) @vyasr
- Optimization to decoding of parquet level streams (#13203) @nvdbaranec
- Clean up and simplify
gpuDecideCompression(#13202) @vuule - Use std::array for a statically sized vector in
create_serialized_trie(#13201) @vuule - Update minimum Python version to Python 3.9 (#13196) @shwina
- Refactor contiguoussplit API into contiguoussplit.hpp (#13186) @abellina
- Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
- Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
- Split up unique_count.cu to improve build time (#13169) @davidwendt
- Use nvtx3 includes in string examples. (#13165) @bdice
- Change some .cu gtest files to .cpp (#13155) @davidwendt
- Remove wheel pytest verbosity (#13151) @sevagh
- Fix libcudf to always pass null-count to setnullmask (#13149) @davidwendt
- Fix gtests to always pass null-count to setnullmask calls (#13148) @davidwendt
- Optimize JSON writer (#13144) @karthikeyann
- Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
- [REVIEW] Deprecate
padandbackfillmethods (#13140) @galipremsagar - Use CTAD instead of functions in ProtobufReader (#13135) @vuule
- Remove more instances of
UNKNOWN_NULL_COUNT(#13134) @vyasr - Update clang-format to 16.0.1. (#13133) @bdice
- Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
- Branch 23.06 merge 23.04 (#13131) @vyasr
- Compute null-count in cudf::detail::slice (#13124) @davidwendt
- Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
- Set null-count in linkedcolumnview conversion operator (#13121) @davidwendt
- Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
- Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
- Remove uses-setup-env-vars (#13105) @vyasr
- Explicitly compute null count in concatenate APIs (#13104) @vyasr
- Replace unnecessary uses of
UNKNOWN_NULL_COUNT(#13102) @vyasr - Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
- Use
.element()instead of.data()for window range calculations (#13095) @mythrocks - Cleanup Parquet chunked writer (#13094) @ttnghia
- Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
- Cleanup ORC chunked writer (#13091) @ttnghia
- Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
- Change cudf::test::makenullmask to also return null-count (#13081) @davidwendt
- Resolved automerger from
branch-23.04tobranch-23.06(#13080) @galipremsagar - Assert for non-empty nulls (#13071) @razajafri
- Remove deprecated regex functions from libcudf (#13067) @davidwendt
- Refactor
cudf::detail::sorted_order(#13062) @ttnghia - Improve performance of slice_strings for long strings (#13057) @davidwendt
- Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
- [REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
- Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
- Remove console output from some libcudf gtests (#13027) @davidwendt
- Remove underscore in build string. (#13025) @bdice
- Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
- Fix auto merger from
branch-23.04tobranch-23.06(#13009) @galipremsagar - Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
- Add nvtx annotatations to groupby methods (#12941) @wence-
- Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
- Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
- Optimize set-like operations (#12769) @ttnghia
- [REVIEW] Upgrade to
arrow-11(#12757) @galipremsagar - Add empty test files for test reorganization (#12288) @shwina
- C++
Published by rapids-bot[bot] over 2 years ago
https://github.com/rapidsai/cudf - v23.06.01
π¨ Breaking Changes
- Fix batch processing for parquet writer (#13438) @ttnghia
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Remove UNKNOWNNULLCOUNT (#13372) @vyasr
- Remove default UNKNOWNNULLCOUNT from cudf::column member functions (#13341) @davidwendt
- Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
- Remove null mask and null count from column_view constructors (#13311) @vyasr
- Change default value of the
observed=argument in groupby toTrueto reflect the actual behaviour (#13296) @shwina - Throw error if UNINITIALIZED is passed to cudf::statenullcount (#13292) @davidwendt
- Remove default null-count parameter from cudf::makestringscolumn factory (#13227) @davidwendt
- Remove UNKNOWNNULLCOUNT where it can be easily computed (#13205) @vyasr
- Update minimum Python version to Python 3.9 (#13196) @shwina
- Refactor contiguoussplit API into contiguoussplit.hpp (#13186) @abellina
- Cleanup Parquet chunked writer (#13094) @ttnghia
- Cleanup ORC chunked writer (#13091) @ttnghia
- Raise
NotImplementedErrorwhen attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Remove deprecated regex functions from libcudf (#13067) @davidwendt
- [REVIEW] Upgrade to
arrow-11(#12757) @galipremsagar - Implement Python dropduplicates with cudf::stabledistinct. (#11656) @brandon-b-miller
π Bug Fixes
- Fix valid count computation in offsetbitmaskbinop kernel (#13489) @davidwendt
- Fix writing of ORC files with empty rowgroups (#13466) @vuule
- Fix cudf::repeat logic when count is zero (#13459) @davidwendt
- Fix batch processing for parquet writer (#13438) @ttnghia
- Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
- Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
- Fix cudf::strings::replacewithbackrefs hang on empty match result (#13418) @davidwendt
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Fix tokenize with non-space delimiter (#13403) @shwina
- Fix groupby head/tail for empty dataframe (#13398) @shwina
- Default to closed="right" in
IntervalIndexconstructor (#13394) @shwina - Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
- Fix unused argument errors in nvcc 11.5 (#13387) @abellina
- Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
- Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
- Fix page size estimation in Parquet writer (#13364) @etseidl
- Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
- Support gcc 12 as the C++ compiler (#13316) @robertmaynard
- Correctly set bitmask size in
from_column_view(#13315) @wence- - Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
- Fix parquet schema interpretation issue (#13277) @hyperbolic2346
- Fix 64bit shift bug in avro reader (#13276) @karthikeyann
- Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
- Clean up buffers in case AssertionError (#13262) @razajafri
- Allow empty input table in ast
compute_column(#13245) @wence- - Fix structscolumnwrapper constructors to copy input column wrappers (#13243) @davidwendt
- Fix the row index stream order in ORC reader (#13242) @vuule
- Make
is_decompression_disabledandis_compression_disabledthread-safe (#13240) @vuule - Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
- Fix race in ORC string dictionary creation (#13214) @revans2
- Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
- Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
- Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
- Fix
hostdevice_vector::subspan(#13187) @ttnghia - Use custom nvbench entry point to ensure
cudf::nvbench_base_fixtureusage (#13183) @robertmaynard - Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
- Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
- Fix a few clang-format style check errors (#13146) @davidwendt
- [REVIEW] Fix
SeriesandDataFrameconstructors to validate index lengths (#13122) @galipremsagar - Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
- Fix GPUARCHS setting in Java CMake build and CMAKECUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
- Adds checks to make sure json reader won't overflow (#13115) @elstehle
- Fix
null_countof columns returned bychunked_parquet_reader(#13111) @vuule - Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
- [REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
- Use makeemptylistscolumn instead of makeemptycolumn(typeid::LIST) (#13099) @davidwendt
- Raise
NotImplementedErrorwhen attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Fix column selection
read_parquetbenchmarks (#13082) @vuule - Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
- Add algorithm include in data_sink.hpp (#13068) @ahendriksen
- Fix tests/identifystreamusage.cpp (#13066) @ahendriksen
- Prevent overflow with
skip_rowsin ORC and Parquet readers (#13063) @vuule - Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
- [REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
- Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
- Fix readavro() skiprows and num_rows. (#12912) @tpn
- Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
- Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina
π New Features
- Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
- Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
- Use compileor_get in JIT groupby apply (#13350) @brandon-b-miller
- cuDF numba cuda 12 updates (#13337) @brandon-b-miller
- Add tz_convert method to convert between timestamps (#13328) @shwina
- Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
- Support the case=False argument to str.contains (#13290) @shwina
- Add an event handler for ColumnVector.close (#13279) @abellina
- JNI api for cudf::chunked_pack (#13278) @abellina
- Implement a chunked_pack API (#13260) @abellina
- Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
- JNI changes for range-extents in window functions. (#13199) @mythrocks
- Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
- Add IS_NULL operator to AST (#13145) @karthikeyann
- STRING order-by column for RANGE window functions (#13143) @mythrocks
- Update
contains_tableto experimental row hasher and equality comparator (#13119) @divyegala - Automatically select
GroupBy.applyalgorithm based on if the UDF is jittable (#13113) @brandon-b-miller - Refactor Parquet chunked writer (#13076) @ttnghia
- Add Python bindings for string literal support in AST (#13073) @karthikeyann
- Add Java bindings for string literal support in AST (#13072) @karthikeyann
- Add string scalar support in AST (#13061) @karthikeyann
- Log cuIO warnings using the libcudf logger (#13043) @vuule
- Update
mixed_jointo use experimental row hasher and comparator (#13028) @divyegala - Support structs of lists in row lexicographic comparator (#13005) @ttnghia
- Adding
hostdevice_spanthat is a span createable fromhostdevice_vector(#12981) @hyperbolic2346 - Add nvtext::minhash function (#12961) @davidwendt
- Support lists of structs in row lexicographic comparator (#12953) @ttnghia
- Update
jointo use experimental row hasher and comparator (#12787) @divyegala - Implement Python dropduplicates with cudf::stabledistinct. (#11656) @brandon-b-miller
π οΈ Improvements
- Bump typing_extensions minimum version to 4.0.0 (#13618) @shwina
- Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
- Handle some corner-cases in indexing with boolean masks (#13402) @wence-
- Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
- [JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
- Fix JNI method with mismatched parameter list (#13384) @ttnghia
- Split up experimentalrowoperator_tests.cu to improve its compile time (#13382) @davidwendt
- Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
- Remove UNKNOWNNULLCOUNT (#13372) @vyasr
- Move some nvtext benchmarks to nvbench (#13368) @davidwendt
- run docs nightly too (#13366) @AyodeAwe
- Add warning for default
dtypeparameter inget_dummies(#13365) @galipremsagar - Add log messages about kvikIO compatibility mode (#13363) @vuule
- Switch back to using primary shared-action-workflows branch (#13362) @vyasr
- Deprecate
StringIndexand useIndexinstead (#13361) @galipremsagar - Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
- Expunge most uses of
TypeVar(bound="Foo")(#13346) @wence- - Remove all references to UNKNOWNNULLCOUNT in Python (#13345) @vyasr
- Improve
distinct_countwithcuco::static_set(#13343) @PointKernel - Fix
contiguous_splitperformance (#13342) @ttnghia - Remove default UNKNOWNNULLCOUNT from cudf::column member functions (#13341) @davidwendt
- Update mypy to 1.3 (#13340) @wence-
- [Java] Purge non-empty nulls when setting validity (#13335) @razajafri
- Add row-wise filtering step to
read_parquet(#13334) @rjzamora - Performance improvement for nvtext::minhash (#13333) @davidwendt
- Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
- Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
- Move
metacalculation indask_cudf.read_parquet(#13327) @rjzamora - Changes to support Numpy >= 1.24 (#13325) @shwina
- Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
- Clean up
distinct_countbenchmark (#13321) @PointKernel - Fix gtest pinning to 1.13.0. (#13319) @bdice
- Remove null mask and null count from column_view constructors (#13311) @vyasr
- Address feedback from 13289 (#13306) @vyasr
- Change default value of the
observed=argument in groupby toTrueto reflect the actual behaviour (#13296) @shwina - First check for
BaseDtypewhen infering the data type of an arbitrary object (#13295) @shwina - Throw error if UNINITIALIZED is passed to cudf::statenullcount (#13292) @davidwendt
- Support CUDA 12.0 for pip wheels (#13289) @divyegala
- Refactor
transform_lists_of_structsinrow_operators.cu(#13288) @ttnghia - Branch 23.06 merge 23.04 (#13286) @vyasr
- Update cupy dependency (#13284) @vyasr
- Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
- Fix unused variables and functions (#13275) @karthikeyann
- Fix integer overflow in
partitionscatter_mapconstruction (#13272) @wence- - Numba 0.57 compatibility fixes (#13271) @gmarkall
- Performance improvement in cudf::strings::allcharactersof_type (#13259) @davidwendt
- Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
- Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
- Build wheels using new single image workflow (#13249) @vyasr
- Enable sccache hits from local builds (#13248) @AyodeAwe
- Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
- Introduce
pandas_compatibleoption incudf(#13241) @galipremsagar - Add metadata_builder helper class (#13232) @abellina
- Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
- Remove default null-count parameter from cudf::makestringscolumn factory (#13227) @davidwendt
- Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
- Add chunked reader benchmark (#13223) @SrikarVanavasam
- Set the null count in output columns in the CSV reader (#13221) @vuule
- Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
- Fix stringscalar stream usage in writejson.cu (#13212) @davidwendt
- Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
- Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
- Remove UNKNOWNNULLCOUNT where it can be easily computed (#13205) @vyasr
- Optimization to decoding of parquet level streams (#13203) @nvdbaranec
- Clean up and simplify
gpuDecideCompression(#13202) @vuule - Use std::array for a statically sized vector in
create_serialized_trie(#13201) @vuule - Update minimum Python version to Python 3.9 (#13196) @shwina
- Refactor contiguoussplit API into contiguoussplit.hpp (#13186) @abellina
- Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
- Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
- Split up unique_count.cu to improve build time (#13169) @davidwendt
- Use nvtx3 includes in string examples. (#13165) @bdice
- Change some .cu gtest files to .cpp (#13155) @davidwendt
- Remove wheel pytest verbosity (#13151) @sevagh
- Fix libcudf to always pass null-count to setnullmask (#13149) @davidwendt
- Fix gtests to always pass null-count to setnullmask calls (#13148) @davidwendt
- Optimize JSON writer (#13144) @karthikeyann
- Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
- [REVIEW] Deprecate
padandbackfillmethods (#13140) @galipremsagar - Use CTAD instead of functions in ProtobufReader (#13135) @vuule
- Remove more instances of
UNKNOWN_NULL_COUNT(#13134) @vyasr - Update clang-format to 16.0.1. (#13133) @bdice
- Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
- Branch 23.06 merge 23.04 (#13131) @vyasr
- Compute null-count in cudf::detail::slice (#13124) @davidwendt
- Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
- Set null-count in linkedcolumnview conversion operator (#13121) @davidwendt
- Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
- Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
- Remove uses-setup-env-vars (#13105) @vyasr
- Explicitly compute null count in concatenate APIs (#13104) @vyasr
- Replace unnecessary uses of
UNKNOWN_NULL_COUNT(#13102) @vyasr - Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
- Use
.element()instead of.data()for window range calculations (#13095) @mythrocks - Cleanup Parquet chunked writer (#13094) @ttnghia
- Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
- Cleanup ORC chunked writer (#13091) @ttnghia
- Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
- Change cudf::test::makenullmask to also return null-count (#13081) @davidwendt
- Resolved automerger from
branch-23.04tobranch-23.06(#13080) @galipremsagar - Assert for non-empty nulls (#13071) @razajafri
- Remove deprecated regex functions from libcudf (#13067) @davidwendt
- Refactor
cudf::detail::sorted_order(#13062) @ttnghia - Improve performance of slice_strings for long strings (#13057) @davidwendt
- Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
- [REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
- Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
- Remove console output from some libcudf gtests (#13027) @davidwendt
- Remove underscore in build string. (#13025) @bdice
- Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
- Fix auto merger from
branch-23.04tobranch-23.06(#13009) @galipremsagar - Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
- Add nvtx annotatations to groupby methods (#12941) @wence-
- Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
- Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
- Optimize set-like operations (#12769) @ttnghia
- [REVIEW] Upgrade to
arrow-11(#12757) @galipremsagar - Add empty test files for test reorganization (#12288) @shwina
- C++
Published by raydouglass over 2 years ago
https://github.com/rapidsai/cudf - v23.06.00
π¨ Breaking Changes
- Fix batch processing for parquet writer (#13438) @ttnghia
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Remove UNKNOWNNULLCOUNT (#13372) @vyasr
- Remove default UNKNOWNNULLCOUNT from cudf::column member functions (#13341) @davidwendt
- Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
- Remove null mask and null count from column_view constructors (#13311) @vyasr
- Change default value of the
observed=argument in groupby toTrueto reflect the actual behaviour (#13296) @shwina - Throw error if UNINITIALIZED is passed to cudf::statenullcount (#13292) @davidwendt
- Remove default null-count parameter from cudf::makestringscolumn factory (#13227) @davidwendt
- Remove UNKNOWNNULLCOUNT where it can be easily computed (#13205) @vyasr
- Update minimum Python version to Python 3.9 (#13196) @shwina
- Refactor contiguoussplit API into contiguoussplit.hpp (#13186) @abellina
- Cleanup Parquet chunked writer (#13094) @ttnghia
- Cleanup ORC chunked writer (#13091) @ttnghia
- Raise
NotImplementedErrorwhen attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Remove deprecated regex functions from libcudf (#13067) @davidwendt
- [REVIEW] Upgrade to
arrow-11(#12757) @galipremsagar - Implement Python dropduplicates with cudf::stabledistinct. (#11656) @brandon-b-miller
π Bug Fixes
- Fix valid count computation in offsetbitmaskbinop kernel (#13489) @davidwendt
- Fix writing of ORC files with empty rowgroups (#13466) @vuule
- Fix cudf::repeat logic when count is zero (#13459) @davidwendt
- Fix batch processing for parquet writer (#13438) @ttnghia
- Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
- Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
- Fix cudf::strings::replacewithbackrefs hang on empty match result (#13418) @davidwendt
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Fix tokenize with non-space delimiter (#13403) @shwina
- Fix groupby head/tail for empty dataframe (#13398) @shwina
- Default to closed="right" in
IntervalIndexconstructor (#13394) @shwina - Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
- Fix unused argument errors in nvcc 11.5 (#13387) @abellina
- Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
- Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
- Fix page size estimation in Parquet writer (#13364) @etseidl
- Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
- Support gcc 12 as the C++ compiler (#13316) @robertmaynard
- Correctly set bitmask size in
from_column_view(#13315) @wence- - Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
- Fix parquet schema interpretation issue (#13277) @hyperbolic2346
- Fix 64bit shift bug in avro reader (#13276) @karthikeyann
- Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
- Clean up buffers in case AssertionError (#13262) @razajafri
- Allow empty input table in ast
compute_column(#13245) @wence- - Fix structscolumnwrapper constructors to copy input column wrappers (#13243) @davidwendt
- Fix the row index stream order in ORC reader (#13242) @vuule
- Make
is_decompression_disabledandis_compression_disabledthread-safe (#13240) @vuule - Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
- Fix race in ORC string dictionary creation (#13214) @revans2
- Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
- Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
- Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
- Fix
hostdevice_vector::subspan(#13187) @ttnghia - Use custom nvbench entry point to ensure
cudf::nvbench_base_fixtureusage (#13183) @robertmaynard - Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
- Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
- Fix a few clang-format style check errors (#13146) @davidwendt
- [REVIEW] Fix
SeriesandDataFrameconstructors to validate index lengths (#13122) @galipremsagar - Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
- Fix GPUARCHS setting in Java CMake build and CMAKECUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
- Adds checks to make sure json reader won't overflow (#13115) @elstehle
- Fix
null_countof columns returned bychunked_parquet_reader(#13111) @vuule - Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
- [REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
- Use makeemptylistscolumn instead of makeemptycolumn(typeid::LIST) (#13099) @davidwendt
- Raise
NotImplementedErrorwhen attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Fix column selection
read_parquetbenchmarks (#13082) @vuule - Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
- Add algorithm include in data_sink.hpp (#13068) @ahendriksen
- Fix tests/identifystreamusage.cpp (#13066) @ahendriksen
- Prevent overflow with
skip_rowsin ORC and Parquet readers (#13063) @vuule - Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
- [REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
- Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
- Fix readavro() skiprows and num_rows. (#12912) @tpn
- Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
- Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina
π New Features
- Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
- Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
- Use compileor_get in JIT groupby apply (#13350) @brandon-b-miller
- cuDF numba cuda 12 updates (#13337) @brandon-b-miller
- Add tz_convert method to convert between timestamps (#13328) @shwina
- Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
- Support the case=False argument to str.contains (#13290) @shwina
- Add an event handler for ColumnVector.close (#13279) @abellina
- JNI api for cudf::chunked_pack (#13278) @abellina
- Implement a chunked_pack API (#13260) @abellina
- Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
- JNI changes for range-extents in window functions. (#13199) @mythrocks
- Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
- Add IS_NULL operator to AST (#13145) @karthikeyann
- STRING order-by column for RANGE window functions (#13143) @mythrocks
- Update
contains_tableto experimental row hasher and equality comparator (#13119) @divyegala - Automatically select
GroupBy.applyalgorithm based on if the UDF is jittable (#13113) @brandon-b-miller - Refactor Parquet chunked writer (#13076) @ttnghia
- Add Python bindings for string literal support in AST (#13073) @karthikeyann
- Add Java bindings for string literal support in AST (#13072) @karthikeyann
- Add string scalar support in AST (#13061) @karthikeyann
- Log cuIO warnings using the libcudf logger (#13043) @vuule
- Update
mixed_jointo use experimental row hasher and comparator (#13028) @divyegala - Support structs of lists in row lexicographic comparator (#13005) @ttnghia
- Adding
hostdevice_spanthat is a span createable fromhostdevice_vector(#12981) @hyperbolic2346 - Add nvtext::minhash function (#12961) @davidwendt
- Support lists of structs in row lexicographic comparator (#12953) @ttnghia
- Update
jointo use experimental row hasher and comparator (#12787) @divyegala - Implement Python dropduplicates with cudf::stabledistinct. (#11656) @brandon-b-miller
π οΈ Improvements
- Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
- Handle some corner-cases in indexing with boolean masks (#13402) @wence-
- Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
- [JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
- Fix JNI method with mismatched parameter list (#13384) @ttnghia
- Split up experimentalrowoperator_tests.cu to improve its compile time (#13382) @davidwendt
- Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
- Remove UNKNOWNNULLCOUNT (#13372) @vyasr
- Move some nvtext benchmarks to nvbench (#13368) @davidwendt
- run docs nightly too (#13366) @AyodeAwe
- Add warning for default
dtypeparameter inget_dummies(#13365) @galipremsagar - Add log messages about kvikIO compatibility mode (#13363) @vuule
- Switch back to using primary shared-action-workflows branch (#13362) @vyasr
- Deprecate
StringIndexand useIndexinstead (#13361) @galipremsagar - Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
- Expunge most uses of
TypeVar(bound="Foo")(#13346) @wence- - Remove all references to UNKNOWNNULLCOUNT in Python (#13345) @vyasr
- Improve
distinct_countwithcuco::static_set(#13343) @PointKernel - Fix
contiguous_splitperformance (#13342) @ttnghia - Remove default UNKNOWNNULLCOUNT from cudf::column member functions (#13341) @davidwendt
- Update mypy to 1.3 (#13340) @wence-
- [Java] Purge non-empty nulls when setting validity (#13335) @razajafri
- Add row-wise filtering step to
read_parquet(#13334) @rjzamora - Performance improvement for nvtext::minhash (#13333) @davidwendt
- Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
- Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
- Move
metacalculation indask_cudf.read_parquet(#13327) @rjzamora - Changes to support Numpy >= 1.24 (#13325) @shwina
- Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
- Clean up
distinct_countbenchmark (#13321) @PointKernel - Fix gtest pinning to 1.13.0. (#13319) @bdice
- Remove null mask and null count from column_view constructors (#13311) @vyasr
- Address feedback from 13289 (#13306) @vyasr
- Change default value of the
observed=argument in groupby toTrueto reflect the actual behaviour (#13296) @shwina - First check for
BaseDtypewhen infering the data type of an arbitrary object (#13295) @shwina - Throw error if UNINITIALIZED is passed to cudf::statenullcount (#13292) @davidwendt
- Support CUDA 12.0 for pip wheels (#13289) @divyegala
- Refactor
transform_lists_of_structsinrow_operators.cu(#13288) @ttnghia - Branch 23.06 merge 23.04 (#13286) @vyasr
- Update cupy dependency (#13284) @vyasr
- Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
- Fix unused variables and functions (#13275) @karthikeyann
- Fix integer overflow in
partitionscatter_mapconstruction (#13272) @wence- - Numba 0.57 compatibility fixes (#13271) @gmarkall
- Performance improvement in cudf::strings::allcharactersof_type (#13259) @davidwendt
- Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
- Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
- Build wheels using new single image workflow (#13249) @vyasr
- Enable sccache hits from local builds (#13248) @AyodeAwe
- Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
- Introduce
pandas_compatibleoption incudf(#13241) @galipremsagar - Add metadata_builder helper class (#13232) @abellina
- Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
- Remove default null-count parameter from cudf::makestringscolumn factory (#13227) @davidwendt
- Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
- Add chunked reader benchmark (#13223) @SrikarVanavasam
- Set the null count in output columns in the CSV reader (#13221) @vuule
- Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
- Fix stringscalar stream usage in writejson.cu (#13212) @davidwendt
- Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
- Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
- Remove UNKNOWNNULLCOUNT where it can be easily computed (#13205) @vyasr
- Optimization to decoding of parquet level streams (#13203) @nvdbaranec
- Clean up and simplify
gpuDecideCompression(#13202) @vuule - Use std::array for a statically sized vector in
create_serialized_trie(#13201) @vuule - Update minimum Python version to Python 3.9 (#13196) @shwina
- Refactor contiguoussplit API into contiguoussplit.hpp (#13186) @abellina
- Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
- Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
- Split up unique_count.cu to improve build time (#13169) @davidwendt
- Use nvtx3 includes in string examples. (#13165) @bdice
- Change some .cu gtest files to .cpp (#13155) @davidwendt
- Remove wheel pytest verbosity (#13151) @sevagh
- Fix libcudf to always pass null-count to setnullmask (#13149) @davidwendt
- Fix gtests to always pass null-count to setnullmask calls (#13148) @davidwendt
- Optimize JSON writer (#13144) @karthikeyann
- Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
- [REVIEW] Deprecate
padandbackfillmethods (#13140) @galipremsagar - Use CTAD instead of functions in ProtobufReader (#13135) @vuule
- Remove more instances of
UNKNOWN_NULL_COUNT(#13134) @vyasr - Update clang-format to 16.0.1. (#13133) @bdice
- Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
- Branch 23.06 merge 23.04 (#13131) @vyasr
- Compute null-count in cudf::detail::slice (#13124) @davidwendt
- Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
- Set null-count in linkedcolumnview conversion operator (#13121) @davidwendt
- Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
- Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
- Remove uses-setup-env-vars (#13105) @vyasr
- Explicitly compute null count in concatenate APIs (#13104) @vyasr
- Replace unnecessary uses of
UNKNOWN_NULL_COUNT(#13102) @vyasr - Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
- Use
.element()instead of.data()for window range calculations (#13095) @mythrocks - Cleanup Parquet chunked writer (#13094) @ttnghia
- Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
- Cleanup ORC chunked writer (#13091) @ttnghia
- Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
- Change cudf::test::makenullmask to also return null-count (#13081) @davidwendt
- Resolved automerger from
branch-23.04tobranch-23.06(#13080) @galipremsagar - Assert for non-empty nulls (#13071) @razajafri
- Remove deprecated regex functions from libcudf (#13067) @davidwendt
- Refactor
cudf::detail::sorted_order(#13062) @ttnghia - Improve performance of slice_strings for long strings (#13057) @davidwendt
- Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
- [REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
- Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
- Remove console output from some libcudf gtests (#13027) @davidwendt
- Remove underscore in build string. (#13025) @bdice
- Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
- Fix auto merger from
branch-23.04tobranch-23.06(#13009) @galipremsagar - Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
- Add nvtx annotatations to groupby methods (#12941) @wence-
- Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
- Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
- Optimize set-like operations (#12769) @ttnghia
- [REVIEW] Upgrade to
arrow-11(#12757) @galipremsagar - Add empty test files for test reorganization (#12288) @shwina
- C++
Published by raydouglass over 2 years ago
https://github.com/rapidsai/cudf - v23.04.00
π¨ Breaking Changes
- Pin
daskanddistributedfor release (#13070) @galipremsagar - Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
- Update minimum
pandasandnumpypinnings (#12887) @galipremsagar - Deprecate
names&dtypeinIndex.copy(#12825) @galipremsagar - Deprecate
Index.is_*methods (#12820) @galipremsagar - Deprecate
datetime_is_numericfromdescribe(#12818) @galipremsagar - Deprecate
na_sentinelinfactorize(#12817) @galipremsagar - Make string methods return a Series with a useful Index (#12814) @shwina
- Produce useful guidance on overflow error in
to_csv(#12705) @wence- - Move
strings_udfcode into cuDF (#12669) @brandon-b-miller - Remove cudf::strings::repeatstringsoutputsizes and optional parameter from cudf::strings::repeatstrings (#12609) @davidwendt
- Replace message parsing with throwing more specific exceptions (#12426) @vyasr
π Bug Fixes
- Fix memcheck script to execute only _TEST files found in bin/gtests/libcudf (#13006) @davidwendt
- Fix
DataFrameconstructor to broadcast scalar inputs properly (#12997) @galipremsagar - Drop
force_nullable_schemafrom chunked parquet writer (#12996) @galipremsagar - Fix gtest column utility comparator diff reporting (#12995) @davidwendt
- Handle index names while performing
groupby(#12992) @galipremsagar - Fix
__setitem__on string columns when the scalar value ends in a null byte (#12991) @wence- - Fix
sort_valueswhen column is all empty strings (#12988) @eriknw - Remove unused variable and fix memory issue in ORC writer (#12984) @ttnghia
- Pre-emptive fix for upstream
dask.dataframe.read_parquetchanges (#12983) @rjzamora - Remove MANIFEST.in use auto-generated one for sdists and package_data for wheels (#12960) @vyasr
- Update to use rapids-export(COMPONENTS) feature. (#12959) @robertmaynard
- cudftestutil supports static gtest dependencies (#12957) @robertmaynard
- Include gtest in build environment. (#12956) @vyasr
- Correctly handle scalar indices in
Index.__getitem__(#12955) @wence- - Avoid building cython twice (#12945) @galipremsagar
- Fix set index error for Series rolling window operations (#12942) @galipremsagar
- Fix calculation of null counts for Parquet statistics (#12938) @etseidl
- Preserve integer dtype of hive-partitioned column containing nulls (#12930) @rjzamora
- Use getcurrentdeviceresource for intermediate allocations in COLLECTLIST window code (#12927) @karthikeyann
- Mark dlpack tensor deleter as noexcept to match PyCapsule_Destructor signature. (#12921) @bdice
- Fix conda recipe post-link.sh typo (#12916) @pentschev
- minrows and numrows are swapped in ComputePageSizes declaration in Parquet reader (#12886) @etseidl
- Expect cupy to now support bool arrays for dlpack. (#12883) @vyasr
- Use python -m pytest for nightly wheel tests (#12871) @bdice
- Parquet writer columnsize() should return a sizet (#12870) @etseidl
- Fix cudf::hash_partition kernel launch error with decimal128 types (#12863) @davidwendt
- Fix an issue with parquet chunked reader undercounting string lengths. (#12859) @nvdbaranec
- Remove tokenizers pre-install pinning. (#12854) @vyasr
- Fix parquet
RangeIndexbug (#12838) @rjzamora - Remove KAFKAHOSTTEST from compute-sanitizer check (#12831) @davidwendt
- Make string methods return a Series with a useful Index (#12814) @shwina
- Tell cudf_kafka to use header-only fmt (#12796) @vyasr
- Add
GroupBy.dtypes(#12783) @galipremsagar - Fix a leak in a test and clarify some test names (#12781) @revans2
- Fix bug in all-null list due to joinlistelements special handling (#12767) @karthikeyann
- Add try/except for expected null-schema error in read_parquet (#12756) @rjzamora
- Throw an exception if an unsupported page encoding is detected in Parquet reader (#12754) @etseidl
- Fix a bug with
num_keysin_scatter_by_slice(#12749) @thomcom - Bump pinned rapids wheel deps to 23.4 (#12735) @sevagh
- Rework logic in cudf::strings::split_record to improve performance (#12729) @davidwendt
- Add
always_nullableflag to Dremel encoding (#12727) @divyegala - Fix memcheck read error in compound segmented reduce (#12722) @davidwendt
- Fix faulty conditional logic in JIT
GroupBy.apply(#12706) @brandon-b-miller - Produce useful guidance on overflow error in
to_csv(#12705) @wence- - Handle parquet list data corner case (#12698) @nvdbaranec
- Fix missing trailing comma in json writer (#12688) @karthikeyann
- Remove child fom newCudaAsyncMemoryResource (#12681) @abellina
- Handle bool types in
roundAPI (#12670) @galipremsagar - Ensure all of device bitmask is initialized in from_arrow (#12668) @wence-
- Fix
from_arrowto load a sliced arrow table (#12665) @galipremsagar - Fix dask-cudf read_parquet bug for multi-file aggregation (#12663) @rjzamora
- Fix AllocateLikeTest gtests reading uninitialized null-mask (#12643) @davidwendt
- Fix
find_common_dtypeandvaluesto handle complex dtypes (#12537) @galipremsagar - Fix fetching of MultiIndex values when a label is passed (#12521) @galipremsagar
- Fix
Seriescomparison vs scalars (#12519) @brandon-b-miller - Allow casting from
UDFStringback toStringViewto call methods instrings_udf(#12363) @brandon-b-miller
π Documentation
- Fix
GroupBy.applydoc examples rendering (#12994) @brandon-b-miller - add sphinx building and s3 uploading for dask-cudf docs (#12982) @quasiben
- Add developer documentation forbidding default parameters in detail APIs (#12978) @vyasr
- Add README symlink for dask-cudf. (#12946) @bdice
- Remove return type from @return doxygen tags (#12908) @davidwendt
- Fix docs build to be
pydata-sphinx-theme=0.13.0compatible (#12874) @galipremsagar - Add skeleton API and prose documentation for dask-cudf (#12725) @wence-
- Enable doctests for GroupBy methods (#12658) @brandon-b-miller
- Add comment about CUB patch for SegmentedSortInt.Bool gtest (#12611) @davidwendt
π New Features
- Add JNI method for strings::replace multi variety (#12979) @NVnavkumar
- Add nunique aggregation support for cudf::segmented_reduce (#12972) @davidwendt
- Refactor orc chunked writer (#12949) @ttnghia
- Make Parquet writer
nullableoption application to single table writes (#12933) @vuule - Refactor
io::orc::ProtobufWriter(#12877) @ttnghia - Make timezone table independent from ORC (#12805) @vuule
- Cache JIT
GroupBy.applyfunctions (#12802) @brandon-b-miller - Implement initial support for avro logical types (#6482) (#12788) @tpn
- Update
tests/column_utilitiesto useexperimental::equalityrow comparator (#12777) @divyegala - Update
distinct/unique_counttoexperimental::rowhasher/comparator (#12776) @divyegala - Update
hash_partitionto useexperimental::row::row_hasher(#12761) @divyegala - Update
is_sortedto useexperimental::row::lexicographic(#12752) @divyegala - Update default data source in cuio reader benchmarks (#12740) @PointKernel
- Reenable stream identification library in CI (#12714) @vyasr
- Add
regex_programstrings splitting java APIs and tests (#12713) @cindyyuanjiang - Add
regex_programstrings replacing java APIs and tests (#12701) @cindyyuanjiang - Add
regex_programstrings extract java APIs and tests (#12699) @cindyyuanjiang - Variable fragment sizes for Parquet writer (#12685) @etseidl
- Add segmented reduction support for fixed-point types (#12680) @davidwendt
- Move
strings_udfcode into cuDF (#12669) @brandon-b-miller - Add
regex_programsearching APIs and related java classes (#12666) @cindyyuanjiang - Add logging to libcudf (#12637) @vuule
- Add compound aggregations to cudf::segmented_reduce (#12573) @davidwendt
- Convert
rankto use to experimental row comparators (#12481) @divyegala - Use rapids-cmake parallel testing feature (#12451) @robertmaynard
- Enable detection of undesired stream usage (#12089) @vyasr
π οΈ Improvements
- Pin
daskanddistributedfor release (#13070) @galipremsagar - Pin cupy in wheel tests to supported versions (#13041) @vyasr
- Pin numba version (#13001) @vyasr
- Rework gtests SequenceTest to remove using namepace cudf (#12985) @davidwendt
- Stop setting package version attribute in wheels (#12977) @vyasr
- Move detail reduction functions to cudf::reduction::detail namespace (#12971) @davidwendt
- Remove default detail mrs: part7 (#12970) @vyasr
- Remove default detail mrs: part6 (#12969) @vyasr
- Remove default detail mrs: part5 (#12968) @vyasr
- Remove default detail mrs: part4 (#12967) @vyasr
- Remove default detail mrs: part3 (#12966) @vyasr
- Remove default detail mrs: part2 (#12965) @vyasr
- Remove default detail mrs: part1 (#12964) @vyasr
- Add
force_nullable_schemaparameter to Parquet writer. (#12952) @galipremsagar - Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
- Remove remaining default stream parameters (#12943) @vyasr
- Fix cudf::segmented_reduce gtest for ANY aggregation (#12940) @davidwendt
- Implement
groupby.headandgroupby.tail(#12939) @wence- - Fix libcudf gtests to pass null-count=0 for empty validity masks (#12923) @davidwendt
- Migrate parquet encoding to use experimental row operators (#12918) @PointKernel
- Fix benchmarks coded in namespace cudf and using namespace cudf (#12915) @karthikeyann
- Fix io/text gtests coded in namespace cudf::test (#12914) @karthikeyann
- Pass
SCCACHE_S3_USE_SSLto conda builds (#12910) @ajschmidt8 - Fix FST, JSON gtests & benchmarks coded in namespace cudf::test (#12907) @karthikeyann
- Generate pyproject dependencies using dfg (#12906) @vyasr
- Update libcudf counting functions to specify cudf::size_type (#12904) @davidwendt
- Fix
motoenv vars & passAWS_SESSION_TOKENto conda builds (#12902) @ajschmidt8 - Rewrite CSV writer benchmark with nvbench (#12901) @PointKernel
- Rework some code logic to reduce iterator and comparator inlining to improve compile time (#12900) @davidwendt
- Deprecate
line_terminatorin favor oflineterminatorinto_csv(#12896) @wence- - Add
streamandmrparameters forstructs::detail::flatten_nested_columns(#12892) @ttnghia - Deprecate libcudf regex APIs accepting pattern strings directly (#12891) @davidwendt
- Remove default parameters from detail headers in include (#12888) @vyasr
- Update minimum
pandasandnumpypinnings (#12887) @galipremsagar - Implement
groupby.sample(#12882) @wence- - Update JNI build ENV default to gcc 11 (#12881) @pxLi
- Change return type of
cudf::structs::detail::flatten_nested_columnsto smart pointer (#12878) @ttnghia - Fix passing seed parameter to MurmurHash3_32 in cudf::hash() function (#12875) @davidwendt
- Remove manual artifact upload step in CI (#12869) @ajschmidt8
- Update to GCC 11 (#12868) @bdice
- Fix null hive-partition behavior in dask-cudf parquet (#12866) @rjzamora
- Update to protobuf>=4.21.6,<4.22. (#12864) @bdice
- Update RMM allocators (#12861) @pentschev
- Improve performance for replace-multi for long strings (#12858) @davidwendt
- Drop Python 3.7 handling for pickle protocol 4 (#12857) @jakirkham
- Migrate as much as possible to pyproject.toml (#12850) @vyasr
- Enable nbqa pre-commit hooks for isort and black. (#12848) @bdice
- Setting a threshold for KvikIO IO (#12841) @madsbk
- Update datasets download URL (#12840) @jjacobelli
- Make docs builds less verbose (#12836) @AyodeAwe
- Consolidate linter configs into pyproject.toml (#12834) @vyasr
- Deprecate
names&dtypeinIndex.copy(#12825) @galipremsagar - Deprecate
inplaceparameters in categorical methods (#12824) @galipremsagar - Add optional text file support to ninja-log utility (#12823) @davidwendt
- Deprecate
Index.is_*methods (#12820) @galipremsagar - Add dfg as a pre-commit hook (#12819) @vyasr
- Deprecate
datetime_is_numericfromdescribe(#12818) @galipremsagar - Deprecate
na_sentinelinfactorize(#12817) @galipremsagar - Shuffling read into a sub function in parquet read (#12809) @hyperbolic2346
- Fixing parquet coalescing of reads (#12808) @hyperbolic2346
- CI: Remove specification of manual stage for check_style.sh script. (#12803) @csadorf
- Add compute-sanitizer github workflow action to nightly tests (#12800) @davidwendt
- Enable groupby std and variance aggregation types in libcudf Debug build (#12799) @davidwendt
- Expose seed argument to hash_values (#12795) @ayushdg
- Fix groupby gtests coded in namespace cudf::test (#12784) @davidwendt
- Improve performance for cudf::strings::count_characters for long strings (#12779) @davidwendt
- Deallocate encoded data in ORC writer immediately after compression (#12770) @vuule
- Stop force pulling fmt in nvbench. (#12768) @vyasr
- Remove now redundant cuda initialization (#12758) @vyasr
- Adds JSON reader, writer io benchmark (#12753) @karthikeyann
- Use test paths relative to package directory. (#12751) @bdice
- Add build metrics report as artifact to cpp-build workflow (#12750) @davidwendt
- Add JNI methods for detecting and purging non-empty nulls from LIST and STRUCT (#12742) @razajafri
- Stop using versioneer to manage versions (#12741) @vyasr
- Reduce error handling verbosity in CI tests scripts (#12738) @AjayThorve
- Reduce the number of test cases in multibyte_split benchmark (#12737) @PointKernel
- Update shared workflow branches (#12733) @ajschmidt8
- JNI switches to nested JSON reader (#12732) @res-life
- Changing
cudf::io::source_infoto usecudf::host_span<std::byte>in a non-breaking form (#12730) @hyperbolic2346 - Add nvbench environment class for initializing RMM in benchmarks (#12728) @davidwendt
- Split C++ and Python build dependencies into separate lists. (#12724) @bdice
- Add build dependencies to Java tests. (#12723) @bdice
- Allow setting the seed argument for hash partition (#12715) @firestarman
- Remove gpuCI scripts. (#12712) @bdice
- Unpin
daskanddistributedfor development (#12710) @galipremsagar partition_by_hash(): use_split()(#12704) @madsbk- Remove DataFrame.quantiles from docs. (#12684) @bdice
- Fast path for
experimental::row::equality(#12676) @divyegala - Move date to build string in
condarecipe (#12661) @ajschmidt8 - Refactor reduction logic for fixed-point types (#12652) @davidwendt
- Pay off some JNI RMM API tech debt (#12632) @revans2
- Merge
copy-on-writefeature branch intobranch-23.04(#12619) @galipremsagar - Remove cudf::strings::repeatstringsoutputsizes and optional parameter from cudf::strings::repeatstrings (#12609) @davidwendt
- Pin cuda-nvrtc. (#12606) @bdice
- Remove cudf::test::print calls from libcudf gtests (#12604) @davidwendt
- Init JNI version 23.04.0-SNAPSHOT (#12599) @pxLi
- Add performance benchmarks to user facing docs (#12595) @galipremsagar
- Add docs build job (#12592) @AyodeAwe
- Replace message parsing with throwing more specific exceptions (#12426) @vyasr
- Support conversion to/from cudf in dask.dataframe.core.to_backend (#12380) @rjzamora
- C++
Published by raydouglass almost 3 years ago
https://github.com/rapidsai/cudf - [NIGHTLY] v23.02.00
π Links
π¨ Breaking Changes
- Pin
daskanddistributedfor release (#12695) @galipremsagar - Change ways to access
ptrinBuffer(#12587) @galipremsagar - Remove column names (#12578) @vuule
- Default
cudf::io::read_jsonto nested JSON parser (#12544) @vuule - Switch
engine=cudfto the newJSONreader (#12509) @galipremsagar - Add trailing comma support for nested JSON reader (#12448) @karthikeyann
- Upgrade to
arrow-10.0.1(#12327) @galipremsagar - Fail loudly to avoid data corruption with unsupported input in
read_orc(#12325) @vuule - CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
- Remove deprecated code for 23.02 (#12281) @vyasr
- Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
- Purge non-empty nulls for
superimpose_nullsandpush_down_nulls(#12239) @ttnghia - Rename
cudf::structs::detail::superimpose_parent_nullsAPIs (#12230) @ttnghia - Remove JIT type names, refactor idtotype. (#12158) @bdice
- Floor division uses integer division for integral arguments (#12131) @wence-
π Bug Fixes
- Fix update-version.sh (#12745) @raydouglass
- Fix a mask data corruption in UDF (#12647) @galipremsagar
- pre-commit: Update isort version to 5.12.0 (#12645) @wence-
- tests: Skip cuInit tests if cuda-gdb is not found or not working (#12644) @wence-
- Revert regex program java APIs and tests (#12639) @cindyyuanjiang
- Fix leaks in ColumnVectorTest (#12625) @jlowe
- Handle when spillable buffers own each other (#12607) @madsbk
- Fix incorrect null counts for sliced columns in JCudfSerialization (#12589) @jlowe
- lists: Transfer dtypes correctly through list.get (#12586) @wence-
- timedelta: Don't go via float intermediates for floordiv (#12585) @wence-
- Fixing BUG,
get_next_chunk()should use the blocking functiondevice_read()(#12584) @madsbk - Make JNI QuoteStyle accessible outside ai.rapids.cudf (#12572) @mythrocks
partition_by_hash(): support index (#12554) @madsbk- Mixed Join benchmark bug due to wrong conditional column (#12553) @divyegala
- Update List Lexicographical Comparator (#12538) @divyegala
- Dynamically read PTX version (#12534) @brandon-b-miller
- build.sh switch to use
RAPIDSmagic value (#12525) @robertmaynard - Loosen runtime arrow pinning (#12522) @vyasr
- Enable metadata transfer for complex types in transpose (#12491) @galipremsagar
- Fix issues with parquet chunked reader (#12488) @nvdbaranec
- Fix missing metadata transfer in concat for
ListColumn(#12487) @galipremsagar - Rename libcudf substring source files to slice (#12484) @davidwendt
- Fix compile issue with arrow 10 (#12465) @ttnghia
- Fix List offsets bug in mixed type list column in nested JSON reader (#12447) @karthikeyann
- Fix xfail incompatibilities (#12423) @vyasr
- Fix bug in Parquet column index encoding (#12404) @etseidl
- When building Arrow shared look for a shared OpenSSL (#12396) @robertmaynard
- Fix getjsonobject to return empty column on empty input (#12384) @davidwendt
- Pin arrow 9 in testing dependencies to prevent conda solve issues (#12377) @vyasr
- Fix reductions any/all return value for empty input (#12374) @davidwendt
- Fix debug compile errors in parquet.hpp (#12372) @davidwendt
- Purge non-empty nulls in
cudf::make_lists_column(#12370) @ttnghia - Use correct memory resource in io::make_column (#12364) @vyasr
- Add code to detect possible malformed page data in parquet files. (#12360) @nvdbaranec
- Fail loudly to avoid data corruption with unsupported input in
read_orc(#12325) @vuule - Fix NumericPairIteratorTest for float values (#12306) @davidwendt
- Fixes memory allocation in nested JSON tokenizer (#12300) @elstehle
- Reconstruct dtypes correctly for list aggs of struct columns (#12290) @wence-
- Fix regex \A and \Z to strictly match string begin/end (#12282) @davidwendt
- Fix compile issue in
json_chunked_reader.cpp(#12280) @ttnghia - Change reductions any/all to return valid values for empty input (#12279) @davidwendt
- Only exclude join keys that are indices from key columns (#12271) @wence-
- Fix spill to device limit (#12252) @madsbk
- Correct behaviour of sort in
concatfor singleton concatenations (#12247) @wence- - Purge non-empty nulls for
superimpose_nullsandpush_down_nulls(#12239) @ttnghia - Patch CUB DeviceSegmentedSort and remove workaround (#12234) @davidwendt
- Fix memory leak in udf_string::assign(&&) function (#12206) @davidwendt
- Workaround thrust-copy-if limit in json gettreerepresentation (#12190) @davidwendt
- Fix page size calculation in Parquet writer (#12182) @etseidl
- Add cudf::detail::sizestooffsets_iterator to allow checking overflow in offsets (#12180) @davidwendt
- Workaround thrust-copy-if limit in wordpiece-tokenizer (#12168) @davidwendt
- Floor division uses integer division for integral arguments (#12131) @wence-
π Documentation
- Fix link to NVTX (#12598) @sameerz
- Include missing groupby functions in documentation (#12580) @quasiben
- Fix documentation author (#12527) @bdice
- Update libcudf reduction docs for casting output types (#12526) @davidwendt
- Add JSON reader page in user guide (#12499) @GregoryKimball
- Link unsupported iteration API docstrings (#12482) @galipremsagar
strings_udfdoc update (#12469) @brandon-b-miller- Update cudf_assert docs with correct NDEBUG behavior (#12464) @robertmaynard
- Update pre-commit hooks guide (#12395) @bdice
- Update test docs to not use detail comparison utilities (#12332) @PointKernel
- Fix doxygen description for regexprogram::computeworkingmemorysize (#12329) @davidwendt
- Add eval to docs. (#12322) @vyasr
- Turn on xfail_strict=true (#12244) @wence-
- Update 10 minutes to cuDF (#12114) @wence-
π New Features
- Use kvikIO as the default IO backend (#12574) @vuule
- Use
has_nonempty_nullsinstead ofmay_contain_non_empty_nullsinsuperimpose_nullsandpush_down_nulls(#12560) @ttnghia - Add strings methods removeprefix and removesuffix (#12557) @davidwendt
- Add
regex_programjava APIs and unit tests (#12548) @cindyyuanjiang - Default
cudf::io::read_jsonto nested JSON parser (#12544) @vuule - Make string quoting optional on CSV write (#12539) @mythrocks
- Use new nvCOMP API to optimize the compression temp memory size (#12533) @vuule
- Support "values" orient (array of arrays) in Nested JSON reader (#12498) @karthikeyann
one_hot_encodeto use experimental row comparators (#12478) @divyegala- Support %W and %w format specifiers in cudf::strings::to_timestamps (#12475) @davidwendt
- Add JSON Writer (#12474) @karthikeyann
- Refactor
thrust_copy_ifintocudf::detail::copy_if_safe(#12455) @ttnghia - Add trailing comma support for nested JSON reader (#12448) @karthikeyann
- Extract
tokenize_json.hppdetail header fromsrc/io/json/nested_json.hpp(#12432) @ttnghia - JNI bindings to write CSV (#12425) @mythrocks
- Nested JSON depth benchmark (#12371) @karthikeyann
- Implement
lists::reverse(#12336) @ttnghia - Use
device_readin experimentalread_json(#12314) @vuule - Implement JNI for
strings::reverse(#12283) @ttnghia - Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
- Add cudf::strings:like function with multiple patterns (#12269) @davidwendt
- Add environment variable to control host memory allocation in
hostdevice_vector(#12251) @vuule - Add cudf::strings::reverse function (#12227) @davidwendt
- Selectively use dictionary encoding in Parquet writer (#12211) @etseidl
- Support
replaceinstrings_udf(#12207) @brandon-b-miller - Add support to read binary encoded decimals in parquet (#12205) @PointKernel
- Support regex EOL where the string ends with a new-line character (#12181) @davidwendt
- Updating
stream_compaction/uniqueto use new row comparators (#12159) @divyegala - Add device buffer datasource (#12024) @PointKernel
- Implement groupby apply with JIT (#11452) @bwyogatama
π οΈ Improvements
- Update shared workflow branches (#12696) @ajschmidt8
- Pin
daskanddistributedfor release (#12695) @galipremsagar - Don't upload
libcudf-exampleto Anaconda.org (#12671) @ajschmidt8 - Pin wheel dependencies to same RAPIDS release (#12659) @sevagh
- Use CTK 118/cp310 branch of wheel workflows (#12602) @sevagh
- Change ways to access
ptrinBuffer(#12587) @galipremsagar - Version a parquet writer xfail (#12579) @galipremsagar
- Remove column names (#12578) @vuule
- Parquet reader optimization to address V100 regression. (#12577) @nvdbaranec
- Add support for
categorydtypes in CSV reader (#12571) @galipremsagar - Remove
spill_lockparameter fromSpillableBuffer.get_ptr()(#12564) @madsbk - Optimize
cudf::make_lists_column(#12547) @ttnghia - Remove
cudf::strings::repeat_strings_output_sizesfrom Java and JNI (#12546) @ttnghia - Test that cuInit is not called when RAPIDSNOINITIALIZE is set (#12545) @wence-
- Rework repeat_strings to use sizes-to-offsets utility (#12543) @davidwendt
- Replace exclusivescan with sizesto_offsets in cudf::lists::sequences (#12541) @davidwendt
- Rework nvtext::ngrams_tokenize to use sizes-to-offsets utility (#12540) @davidwendt
- Fix binary-ops gtests coded in namespace cudf::test (#12536) @davidwendt
- More
@acquire_spill_lock()andas_buffer(..., exposed=False)(#12535) @madsbk - Guard CUDA runtime APIs with error checking (#12531) @PointKernel
- Update TODOs from issue 10432. (#12528) @bdice
- Update rapids-cmake definitions version in GitHub Actions style checks. (#12511) @bdice
- Switch
engine=cudfto the newJSONreader (#12509) @galipremsagar - Fix SUM/MEAN aggregation type support. (#12503) @bdice
- Stop using pandas._testing (#12492) @vyasr
- Fix ROLLING_TEST gtests coded in namespace cudf::test (#12490) @davidwendt
- Fix erroneously skipped ORC ZSTD test (#12486) @vuule
- Rework nvtext::generatecharacterngrams to use makestringschildren (#12480) @davidwendt
- Raise warnings as errors in the test suite (#12468) @vyasr
- Remove
int32hard-coding in python (#12467) @galipremsagar - Use cudaMemcpyDefault. (#12466) @bdice
- Update workflows for nightly tests (#12462) @ajschmidt8
- Build CUDA
11.8and Python3.10Packages (#12457) @ajschmidt8 - JNI build image default as cuda11.8 (#12441) @pxLi
- Re-enable
Recently UpdatedCheck (#12435) @ajschmidt8 - Rework remaining cudf::strings::fromxyz functions to use makestrings_children (#12434) @vuule
- Build wheels alongside conda CI (#12427) @sevagh
- Remove arguments for checking exception messages in Python (#12424) @vyasr
- Clean up cuco usage (#12421) @PointKernel
- Fix warnings in remaining modules (#12406) @vyasr
- Update
ops-bot.yaml(#12402) @ajschmidt8 - Rework cudf::strings::integerstoipv4 to use makestringschildren utility (#12401) @davidwendt
- Use
numpy.empty()instead ofbytearrayto allocate host memory for spilling (#12399) @madsbk - Deprecate chunksize from daskcudf.readcsv (#12394) @rjzamora
- Expose the RMM pool size in JNI (#12390) @revans2
- Fix COPYING_TEST: gtests coded in namespace cudf::test (#12387) @davidwendt
- Rework cudf::strings::urlencode to use makestrings_children utility (#12385) @davidwendt
- Use makestringschildren in parse_data nested json reader (#12382) @karthikeyann
- Fix warnings in test_datetime.py (#12381) @vyasr
- Mixed Join Benchmarks (#12375) @divyegala
- Fix warnings in dataframe.py (#12369) @vyasr
- Update conda recipes. (#12368) @bdice
- Use gpu-latest-1 runner tag (#12366) @bdice
- Rework cudf::strings::frombooleans to use makestrings_children (#12365) @vuule
- Fix warnings in test modules up to test_dataframe.py (#12355) @vyasr
- JSON column performance optimization - struct column nulls (#12354) @karthikeyann
- Accelerate stable-segmented-sort with CUB segmented sort (#12347) @davidwendt
- Add size check to makeoffsetschild_column utility (#12345) @davidwendt
- Enable max compression ratio small block optimization for ZSTD (#12338) @vuule
- Fix warnings in test_monotonic.py (#12334) @vyasr
- Improve JSON column creation performance (list offsets) (#12330) @karthikeyann
- Upgrade to
arrow-10.0.1(#12327) @galipremsagar - Fix warnings in test_orc.py (#12326) @vyasr
- Fix warnings in test_groupby.py (#12324) @vyasr
- Fix
test_notebooks.sh(#12323) @ajschmidt8 - Fix transform gtests coded in namespace cudf::test (#12321) @davidwendt
- Fix
check_style.shscript (#12320) @ajschmidt8 - Rework cudf::strings::fromtimestamps to use makestrings_children (#12317) @davidwendt
- Fix warnings in test_index.py (#12313) @vyasr
- Fix warnings in test_multiindex.py (#12310) @vyasr
- CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
- Fix warnings in test_indexing.py (#12305) @vyasr
- Fix warnings in test_joining.py (#12304) @vyasr
- Unpin
daskanddistributedfor development (#12302) @galipremsagar - Re-enable
sccachefor Jenkins builds (#12297) @ajschmidt8 - Define needs for pr-builder workflow. (#12296) @bdice
- Forward merge 22.12 into 23.02 (#12294) @vyasr
- Fix warnings in test_stats.py (#12293) @vyasr
- Fix table gtests coded in namespace cudf::test (#12292) @davidwendt
- Change cython for regex calls to use cudf::strings::regex_program (#12289) @davidwendt
- Improved error reporting when reading multiple JSON files (#12285) @vuule
- Deprecate Frame.sumofsquares (#12284) @vyasr
- Remove deprecated code for 23.02 (#12281) @vyasr
- Clean up handling of maxpagesize_bytes in Parquet writer (#12277) @etseidl
- Fix replace gtests coded in namespace cudf::test (#12270) @davidwendt
- Add pandas nullable type support in
Index.to_pandas(#12268) @galipremsagar - Rework nvtext::detokenize to use indexalator for row indices (#12267) @davidwendt
- Fix reduction gtests coded in namespace cudf::test (#12257) @davidwendt
- Remove default parameters from cudf::detail::sort function declarations (#12254) @davidwendt
- Add
duplicatedsupport forSeries,DataFrameandIndex(#12246) @galipremsagar - Replace column/table test utilities with macros (#12242) @PointKernel
- Rework cudf::strings::pad and zfill to use makestringschildren (#12238) @davidwendt
- Fix sort gtests coded in namespace cudf::test (#12237) @davidwendt
- Wrapping concat and file writes in
@acquire_spill_lock()(#12232) @madsbk - Rename
cudf::structs::detail::superimpose_parent_nullsAPIs (#12230) @ttnghia - Cover parsing to decimal types in
read_jsontests (#12229) @vuule - Spill Statistics (#12223) @madsbk
- Use CUDFJNIENABLE_PROFILING to conditionally enable profiling support. (#12221) @bdice
- Clean up of
test_spilling.py(#12220) @madsbk - Simplify repetitive boolean logic (#12218) @vuule
- Add
Series.hasnansandIndex.hasnans(#12214) @galipremsagar - Add cudf::strings:udf::replace function (#12210) @davidwendt
- Adds in new java APIs for appending byte arrays to host columnar data (#12208) @revans2
- Remove Python dependencies from Java CI. (#12193) @bdice
- Fix null order in sort-based groupby and improve groupby tests (#12191) @divyegala
- Move strings children functions from cudf/strings/detail/utilities.cuh to new header (#12185) @davidwendt
- Clean up existing JNI scalar to column code (#12173) @revans2
- Remove JIT type names, refactor idtotype. (#12158) @bdice
- Update JNI version to 23.02.0-SNAPSHOT (#12129) @pxLi
- Minor refactor of cpp/src/io/parquet/page_data.cu (#12126) @etseidl
- Add codespell as a linter (#12097) @benfred
- Enable specifying exceptions in error macros (#12078) @vyasr
- Move
_label_encodingfrom Series to Column (#12040) @shwina - Add GitHub Actions Workflows (#12002) @ajschmidt8
- Consolidate dask-cudf
groupby_aggcalls in one place (#10835) @charlesbluca
- C++
Published by rapids-bot[bot] about 3 years ago
https://github.com/rapidsai/cudf - v23.02.00
π¨ Breaking Changes
- Pin
daskanddistributedfor release (#12695) @galipremsagar - Change ways to access
ptrinBuffer(#12587) @galipremsagar - Remove column names (#12578) @vuule
- Default
cudf::io::read_jsonto nested JSON parser (#12544) @vuule - Switch
engine=cudfto the newJSONreader (#12509) @galipremsagar - Add trailing comma support for nested JSON reader (#12448) @karthikeyann
- Upgrade to
arrow-10.0.1(#12327) @galipremsagar - Fail loudly to avoid data corruption with unsupported input in
read_orc(#12325) @vuule - CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
- Remove deprecated code for 23.02 (#12281) @vyasr
- Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
- Purge non-empty nulls for
superimpose_nullsandpush_down_nulls(#12239) @ttnghia - Rename
cudf::structs::detail::superimpose_parent_nullsAPIs (#12230) @ttnghia - Remove JIT type names, refactor idtotype. (#12158) @bdice
- Floor division uses integer division for integral arguments (#12131) @wence-
π Bug Fixes
- Fix a mask data corruption in UDF (#12647) @galipremsagar
- pre-commit: Update isort version to 5.12.0 (#12645) @wence-
- tests: Skip cuInit tests if cuda-gdb is not found or not working (#12644) @wence-
- Revert regex program java APIs and tests (#12639) @cindyyuanjiang
- Fix leaks in ColumnVectorTest (#12625) @jlowe
- Handle when spillable buffers own each other (#12607) @madsbk
- Fix incorrect null counts for sliced columns in JCudfSerialization (#12589) @jlowe
- lists: Transfer dtypes correctly through list.get (#12586) @wence-
- timedelta: Don't go via float intermediates for floordiv (#12585) @wence-
- Fixing BUG,
get_next_chunk()should use the blocking functiondevice_read()(#12584) @madsbk - Make JNI QuoteStyle accessible outside ai.rapids.cudf (#12572) @mythrocks
partition_by_hash(): support index (#12554) @madsbk- Mixed Join benchmark bug due to wrong conditional column (#12553) @divyegala
- Update List Lexicographical Comparator (#12538) @divyegala
- Dynamically read PTX version (#12534) @brandon-b-miller
- build.sh switch to use
RAPIDSmagic value (#12525) @robertmaynard - Loosen runtime arrow pinning (#12522) @vyasr
- Enable metadata transfer for complex types in transpose (#12491) @galipremsagar
- Fix issues with parquet chunked reader (#12488) @nvdbaranec
- Fix missing metadata transfer in concat for
ListColumn(#12487) @galipremsagar - Rename libcudf substring source files to slice (#12484) @davidwendt
- Fix compile issue with arrow 10 (#12465) @ttnghia
- Fix List offsets bug in mixed type list column in nested JSON reader (#12447) @karthikeyann
- Fix xfail incompatibilities (#12423) @vyasr
- Fix bug in Parquet column index encoding (#12404) @etseidl
- When building Arrow shared look for a shared OpenSSL (#12396) @robertmaynard
- Fix getjsonobject to return empty column on empty input (#12384) @davidwendt
- Pin arrow 9 in testing dependencies to prevent conda solve issues (#12377) @vyasr
- Fix reductions any/all return value for empty input (#12374) @davidwendt
- Fix debug compile errors in parquet.hpp (#12372) @davidwendt
- Purge non-empty nulls in
cudf::make_lists_column(#12370) @ttnghia - Use correct memory resource in io::make_column (#12364) @vyasr
- Add code to detect possible malformed page data in parquet files. (#12360) @nvdbaranec
- Fail loudly to avoid data corruption with unsupported input in
read_orc(#12325) @vuule - Fix NumericPairIteratorTest for float values (#12306) @davidwendt
- Fixes memory allocation in nested JSON tokenizer (#12300) @elstehle
- Reconstruct dtypes correctly for list aggs of struct columns (#12290) @wence-
- Fix regex \A and \Z to strictly match string begin/end (#12282) @davidwendt
- Fix compile issue in
json_chunked_reader.cpp(#12280) @ttnghia - Change reductions any/all to return valid values for empty input (#12279) @davidwendt
- Only exclude join keys that are indices from key columns (#12271) @wence-
- Fix spill to device limit (#12252) @madsbk
- Correct behaviour of sort in
concatfor singleton concatenations (#12247) @wence- - Purge non-empty nulls for
superimpose_nullsandpush_down_nulls(#12239) @ttnghia - Patch CUB DeviceSegmentedSort and remove workaround (#12234) @davidwendt
- Fix memory leak in udf_string::assign(&&) function (#12206) @davidwendt
- Workaround thrust-copy-if limit in json gettreerepresentation (#12190) @davidwendt
- Fix page size calculation in Parquet writer (#12182) @etseidl
- Add cudf::detail::sizestooffsets_iterator to allow checking overflow in offsets (#12180) @davidwendt
- Workaround thrust-copy-if limit in wordpiece-tokenizer (#12168) @davidwendt
- Floor division uses integer division for integral arguments (#12131) @wence-
π Documentation
- Fix link to NVTX (#12598) @sameerz
- Include missing groupby functions in documentation (#12580) @quasiben
- Fix documentation author (#12527) @bdice
- Update libcudf reduction docs for casting output types (#12526) @davidwendt
- Add JSON reader page in user guide (#12499) @GregoryKimball
- Link unsupported iteration API docstrings (#12482) @galipremsagar
strings_udfdoc update (#12469) @brandon-b-miller- Update cudf_assert docs with correct NDEBUG behavior (#12464) @robertmaynard
- Update pre-commit hooks guide (#12395) @bdice
- Update test docs to not use detail comparison utilities (#12332) @PointKernel
- Fix doxygen description for regexprogram::computeworkingmemorysize (#12329) @davidwendt
- Add eval to docs. (#12322) @vyasr
- Turn on xfail_strict=true (#12244) @wence-
- Update 10 minutes to cuDF (#12114) @wence-
π New Features
- Use kvikIO as the default IO backend (#12574) @vuule
- Use
has_nonempty_nullsinstead ofmay_contain_non_empty_nullsinsuperimpose_nullsandpush_down_nulls(#12560) @ttnghia - Add strings methods removeprefix and removesuffix (#12557) @davidwendt
- Add
regex_programjava APIs and unit tests (#12548) @cindyyuanjiang - Default
cudf::io::read_jsonto nested JSON parser (#12544) @vuule - Make string quoting optional on CSV write (#12539) @mythrocks
- Use new nvCOMP API to optimize the compression temp memory size (#12533) @vuule
- Support "values" orient (array of arrays) in Nested JSON reader (#12498) @karthikeyann
one_hot_encodeto use experimental row comparators (#12478) @divyegala- Support %W and %w format specifiers in cudf::strings::to_timestamps (#12475) @davidwendt
- Add JSON Writer (#12474) @karthikeyann
- Refactor
thrust_copy_ifintocudf::detail::copy_if_safe(#12455) @ttnghia - Add trailing comma support for nested JSON reader (#12448) @karthikeyann
- Extract
tokenize_json.hppdetail header fromsrc/io/json/nested_json.hpp(#12432) @ttnghia - JNI bindings to write CSV (#12425) @mythrocks
- Nested JSON depth benchmark (#12371) @karthikeyann
- Implement
lists::reverse(#12336) @ttnghia - Use
device_readin experimentalread_json(#12314) @vuule - Implement JNI for
strings::reverse(#12283) @ttnghia - Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
- Add cudf::strings:like function with multiple patterns (#12269) @davidwendt
- Add environment variable to control host memory allocation in
hostdevice_vector(#12251) @vuule - Add cudf::strings::reverse function (#12227) @davidwendt
- Selectively use dictionary encoding in Parquet writer (#12211) @etseidl
- Support
replaceinstrings_udf(#12207) @brandon-b-miller - Add support to read binary encoded decimals in parquet (#12205) @PointKernel
- Support regex EOL where the string ends with a new-line character (#12181) @davidwendt
- Updating
stream_compaction/uniqueto use new row comparators (#12159) @divyegala - Add device buffer datasource (#12024) @PointKernel
- Implement groupby apply with JIT (#11452) @bwyogatama
π οΈ Improvements
- Update shared workflow branches (#12696) @ajschmidt8
- Pin
daskanddistributedfor release (#12695) @galipremsagar - Don't upload
libcudf-exampleto Anaconda.org (#12671) @ajschmidt8 - Pin wheel dependencies to same RAPIDS release (#12659) @sevagh
- Use CTK 118/cp310 branch of wheel workflows (#12602) @sevagh
- Change ways to access
ptrinBuffer(#12587) @galipremsagar - Version a parquet writer xfail (#12579) @galipremsagar
- Remove column names (#12578) @vuule
- Parquet reader optimization to address V100 regression. (#12577) @nvdbaranec
- Add support for
categorydtypes in CSV reader (#12571) @galipremsagar - Remove
spill_lockparameter fromSpillableBuffer.get_ptr()(#12564) @madsbk - Optimize
cudf::make_lists_column(#12547) @ttnghia - Remove
cudf::strings::repeat_strings_output_sizesfrom Java and JNI (#12546) @ttnghia - Test that cuInit is not called when RAPIDSNOINITIALIZE is set (#12545) @wence-
- Rework repeat_strings to use sizes-to-offsets utility (#12543) @davidwendt
- Replace exclusivescan with sizesto_offsets in cudf::lists::sequences (#12541) @davidwendt
- Rework nvtext::ngrams_tokenize to use sizes-to-offsets utility (#12540) @davidwendt
- Fix binary-ops gtests coded in namespace cudf::test (#12536) @davidwendt
- More
@acquire_spill_lock()andas_buffer(..., exposed=False)(#12535) @madsbk - Guard CUDA runtime APIs with error checking (#12531) @PointKernel
- Update TODOs from issue 10432. (#12528) @bdice
- Update rapids-cmake definitions version in GitHub Actions style checks. (#12511) @bdice
- Switch
engine=cudfto the newJSONreader (#12509) @galipremsagar - Fix SUM/MEAN aggregation type support. (#12503) @bdice
- Stop using pandas._testing (#12492) @vyasr
- Fix ROLLING_TEST gtests coded in namespace cudf::test (#12490) @davidwendt
- Fix erroneously skipped ORC ZSTD test (#12486) @vuule
- Rework nvtext::generatecharacterngrams to use makestringschildren (#12480) @davidwendt
- Raise warnings as errors in the test suite (#12468) @vyasr
- Remove
int32hard-coding in python (#12467) @galipremsagar - Use cudaMemcpyDefault. (#12466) @bdice
- Update workflows for nightly tests (#12462) @ajschmidt8
- Build CUDA
11.8and Python3.10Packages (#12457) @ajschmidt8 - JNI build image default as cuda11.8 (#12441) @pxLi
- Re-enable
Recently UpdatedCheck (#12435) @ajschmidt8 - Rework remaining cudf::strings::fromxyz functions to use makestrings_children (#12434) @vuule
- Build wheels alongside conda CI (#12427) @sevagh
- Remove arguments for checking exception messages in Python (#12424) @vyasr
- Clean up cuco usage (#12421) @PointKernel
- Fix warnings in remaining modules (#12406) @vyasr
- Update
ops-bot.yaml(#12402) @ajschmidt8 - Rework cudf::strings::integerstoipv4 to use makestringschildren utility (#12401) @davidwendt
- Use
numpy.empty()instead ofbytearrayto allocate host memory for spilling (#12399) @madsbk - Deprecate chunksize from daskcudf.readcsv (#12394) @rjzamora
- Expose the RMM pool size in JNI (#12390) @revans2
- Fix COPYING_TEST: gtests coded in namespace cudf::test (#12387) @davidwendt
- Rework cudf::strings::urlencode to use makestrings_children utility (#12385) @davidwendt
- Use makestringschildren in parse_data nested json reader (#12382) @karthikeyann
- Fix warnings in test_datetime.py (#12381) @vyasr
- Mixed Join Benchmarks (#12375) @divyegala
- Fix warnings in dataframe.py (#12369) @vyasr
- Update conda recipes. (#12368) @bdice
- Use gpu-latest-1 runner tag (#12366) @bdice
- Rework cudf::strings::frombooleans to use makestrings_children (#12365) @vuule
- Fix warnings in test modules up to test_dataframe.py (#12355) @vyasr
- JSON column performance optimization - struct column nulls (#12354) @karthikeyann
- Accelerate stable-segmented-sort with CUB segmented sort (#12347) @davidwendt
- Add size check to makeoffsetschild_column utility (#12345) @davidwendt
- Enable max compression ratio small block optimization for ZSTD (#12338) @vuule
- Fix warnings in test_monotonic.py (#12334) @vyasr
- Improve JSON column creation performance (list offsets) (#12330) @karthikeyann
- Upgrade to
arrow-10.0.1(#12327) @galipremsagar - Fix warnings in test_orc.py (#12326) @vyasr
- Fix warnings in test_groupby.py (#12324) @vyasr
- Fix
test_notebooks.sh(#12323) @ajschmidt8 - Fix transform gtests coded in namespace cudf::test (#12321) @davidwendt
- Fix
check_style.shscript (#12320) @ajschmidt8 - Rework cudf::strings::fromtimestamps to use makestrings_children (#12317) @davidwendt
- Fix warnings in test_index.py (#12313) @vyasr
- Fix warnings in test_multiindex.py (#12310) @vyasr
- CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
- Fix warnings in test_indexing.py (#12305) @vyasr
- Fix warnings in test_joining.py (#12304) @vyasr
- Unpin
daskanddistributedfor development (#12302) @galipremsagar - Re-enable
sccachefor Jenkins builds (#12297) @ajschmidt8 - Define needs for pr-builder workflow. (#12296) @bdice
- Forward merge 22.12 into 23.02 (#12294) @vyasr
- Fix warnings in test_stats.py (#12293) @vyasr
- Fix table gtests coded in namespace cudf::test (#12292) @davidwendt
- Change cython for regex calls to use cudf::strings::regex_program (#12289) @davidwendt
- Improved error reporting when reading multiple JSON files (#12285) @vuule
- Deprecate Frame.sumofsquares (#12284) @vyasr
- Remove deprecated code for 23.02 (#12281) @vyasr
- Clean up handling of maxpagesize_bytes in Parquet writer (#12277) @etseidl
- Fix replace gtests coded in namespace cudf::test (#12270) @davidwendt
- Add pandas nullable type support in
Index.to_pandas(#12268) @galipremsagar - Rework nvtext::detokenize to use indexalator for row indices (#12267) @davidwendt
- Fix reduction gtests coded in namespace cudf::test (#12257) @davidwendt
- Remove default parameters from cudf::detail::sort function declarations (#12254) @davidwendt
- Add
duplicatedsupport forSeries,DataFrameandIndex(#12246) @galipremsagar - Replace column/table test utilities with macros (#12242) @PointKernel
- Rework cudf::strings::pad and zfill to use makestringschildren (#12238) @davidwendt
- Fix sort gtests coded in namespace cudf::test (#12237) @davidwendt
- Wrapping concat and file writes in
@acquire_spill_lock()(#12232) @madsbk - Rename
cudf::structs::detail::superimpose_parent_nullsAPIs (#12230) @ttnghia - Cover parsing to decimal types in
read_jsontests (#12229) @vuule - Spill Statistics (#12223) @madsbk
- Use CUDFJNIENABLE_PROFILING to conditionally enable profiling support. (#12221) @bdice
- Clean up of
test_spilling.py(#12220) @madsbk - Simplify repetitive boolean logic (#12218) @vuule
- Add
Series.hasnansandIndex.hasnans(#12214) @galipremsagar - Add cudf::strings:udf::replace function (#12210) @davidwendt
- Adds in new java APIs for appending byte arrays to host columnar data (#12208) @revans2
- Remove Python dependencies from Java CI. (#12193) @bdice
- Fix null order in sort-based groupby and improve groupby tests (#12191) @divyegala
- Move strings children functions from cudf/strings/detail/utilities.cuh to new header (#12185) @davidwendt
- Clean up existing JNI scalar to column code (#12173) @revans2
- Remove JIT type names, refactor idtotype. (#12158) @bdice
- Update JNI version to 23.02.0-SNAPSHOT (#12129) @pxLi
- Minor refactor of cpp/src/io/parquet/page_data.cu (#12126) @etseidl
- Add codespell as a linter (#12097) @benfred
- Enable specifying exceptions in error macros (#12078) @vyasr
- Move
_label_encodingfrom Series to Column (#12040) @shwina - Add GitHub Actions Workflows (#12002) @ajschmidt8
- Consolidate dask-cudf
groupby_aggcalls in one place (#10835) @charlesbluca
- C++
Published by raydouglass about 3 years ago
https://github.com/rapidsai/cudf - v22.12.01
π¨ Breaking Changes
- Add JNI for
substringwithout 'end' parameter. (#12113) @firestarman - Refactor
purge_nonempty_nulls(#12111) @ttnghia - Create an
int8column inread_csvwhen all elements are missing (#12110) @vuule - Throw an error when libcudf is built without cuFile and
LIBCUDF_CUFILE_POLICYis set to"ALWAYS"(#12080) @vuule - Fix type promotion edge cases in numerical binops (#12074) @wence-
- Reduce/Remove reliance on
**kwargsand*argsinIOreaders & writers (#12025) @galipremsagar - Rollback of
DeviceBufferLike(#12009) @madsbk - Remove unused
managed_allocator(#12005) @vyasr - Pass column names to
write_csvinstead oftable_metadatapointer (#11972) @vuule - Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
- Default to equal NaNs in makemergesets_aggregation. (#11952) @bdice
- Remove validation that requires introspection (#11938) @vyasr
- Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
- Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
- Support nested types as groupby keys in libcudf (#11792) @PointKernel
- Default to equal NaNs in makecollectset_aggregation. (#11621) @bdice
- Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
- part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
π Bug Fixes
- strings_udf: use libcudf caching of character tables (#12343) @wence-
- Fix include line for IO Cython modules (#12250) @vyasr
- Make dask pinning looser (#12231) @vyasr
- Workaround for CUB segmented-sort bug with boolean keys (#12217) @davidwendt
- Fix
from_dictbackend dispatch to match upstreamdask(#12203) @galipremsagar - Merge branch-22.10 into branch-22.12 (#12198) @davidwendt
- Fix compression in ORC writer (#12194) @vuule
- Don't use CMake 3.25.0 as it has a show stopping FindCUDAToolkit bug (#12188) @robertmaynard
- Fix data corruption when reading ORC files with empty stripes (#12160) @vuule
- Fix decimal binary operations (#12142) @galipremsagar
- Ensure dlpack include is provided to cudf interop lib (#12139) @robertmaynard
- Safely allocate
udf_stringpointers instrings_udf(#12138) @brandon-b-miller - Fix/disable jitify lto (#12122) @robertmaynard
- Fix conditionalfulljoin benchmark (#12121) @GregoryKimball
- Fix regex working-memory-size refactor error (#12119) @davidwendt
- Add in negative size checks for columns (#12118) @revans2
- Add JNI for
substringwithout 'end' parameter. (#12113) @firestarman - Fix reading of CSV files with blank second row (#12098) @vuule
- Fix an error in IO with
GzipFiletype (#12085) @galipremsagar - Workaround groupby aggregate thrust::copy_if overflow (#12079) @davidwendt
- Fix alignment of compressed blocks in ORC writer (#12077) @vuule
- Fix singleton-range
__setitem__edge case (#12075) @wence- - Fix type promotion edge cases in numerical binops (#12074) @wence-
- Force using old fmt in nvbench. (#12067) @vyasr
- Fixes List offset bug in Nested JSON reader (#12060) @karthikeyann
- Allow falling back to
shim_60.ptxby default instrings_udf(#12056) @brandon-b-miller - Force black exclusions for pre-commit. (#12036) @bdice
- Add
memory_usage&itemsimplementation forStructcolumn & dtype (#12033) @galipremsagar - Reduce/Remove reliance on
**kwargsand*argsinIOreaders & writers (#12025) @galipremsagar - Fixes bug in csvreaderoptions construction in cython (#12021) @karthikeyann
- Fix issues when both
usecolsandnamesoptions are used inread_csv(#12018) @vuule - Port thrust's pinned_allocator to cudf, since Thrust 1.17 removes the type (#12004) @robertmaynard
- Revert "Replace most of preprocessor usage in nvcomp adapter with
constexpr" (#11999) @vuule - Fix bug where
df.locresulting in single row could give wrong index (#11998) @eriknw - Switch to DISABLEDEPRECATIONWARNINGS to match other RAPIDS projects (#11989) @robertmaynard
- Fix maximum page size estimate in Parquet writer (#11962) @vuule
- Fix local offset handling in bgzip reader (#11918) @upsj
- Fix an issue reading struct-of-list types in Parquet. (#11910) @nvdbaranec
- Fix memcheck error in TypeInference.Timestamp gtest (#11905) @davidwendt
- Fix type casting in Series.setitem (#11904) @wence-
- Fix memcheck error in getdremeldata (#11903) @davidwendt
- Fixes Unsupported column type error due to empty list columns in Nested JSON reader (#11897) @karthikeyann
- Fix segmented-sort to ignore indices outside the offsets (#11888) @davidwendt
- Fix cudf::stablesortedorder for NaN and -NaN in FLOAT64 columns (#11874) @davidwendt
- Fix writing of Parquet files with many fragments (#11869) @etseidl
- Fix RangeIndex unary operators. (#11868) @vyasr
- JNI Avoid NPE for reading host binary data (#11865) @revans2
- Fix decimal benchmark input data generation (#11863) @karthikeyann
- Fix pre-commit copyright check (#11860) @galipremsagar
- Fix Parquet support for seconds and milliseconds duration types (#11854) @vuule
- Ensure better compiler cache results between cudf cal-ver branches (#11835) @robertmaynard
- Fix makecolumnfrom_scalar for all-null strings column (#11807) @davidwendt
- Tell jitify_preprocess where to search for libnvrtc (#11787) @robertmaynard
- add V2 page header support to parquet reader (#11778) @etseidl
- Parquet reader: bug fix for a numrows/skiprows corner case, w/optimization for nested preprocessing (#11752) @nvdbaranec
- Determine if Arrow has S3 support at runtime in unit test. (#11560) @bdice
π Documentation
- Use rapidsai CODEOFCONDUCT.md (#12166) @bdice
- Add symlinks to notebooks. (#12128) @bdice
- Add
truncateAPI to python doc pages (#12109) @galipremsagar - Update Numba docs links. (#12107) @bdice
- Remove "Multi-GPU with Dask-cuDF" notebook. (#12095) @bdice
- Fix link to c++ developer guide from
CONTRIBUTING.md(#12084) @brandon-b-miller - Add pivot_table and crosstab to docs. (#12014) @bdice
- Fix doxygen text for cudf::dictionary::encode (#11991) @davidwendt
- Replace defaultstreamvalue with getdefaultstream in docs. (#11985) @vyasr
- Add dtype docs pages and docstrings for
cudfspecific dtypes (#11974) @galipremsagar - Update Unit Testing in libcudf guidelines to code tests outside the cudf::test namespace (#11959) @davidwendt
- Rename libcudf++ to libcudf. (#11953) @bdice
- Fix documentation referring to removed asgpumatrix method. (#11937) @bdice
- Remove "experimental" warning for struct columns in ORC reader and writer (#11880) @vuule
- Initial draft of policies and guidelines for libcudf usage. (#11853) @vyasr
- Add clear indication of non-GPU accelerated parameters in read_json docstring (#11825) @GregoryKimball
- Add developer docs for writing tests (#11199) @vyasr
π New Features
- Adds an EventHandler to Java MemoryBuffer to be invoked on close (#12125) @abellina
- Support
+instrings_udf(#12117) @brandon-b-miller - Support
upperandlowerinstrings_udf(#12099) @brandon-b-miller - Add wheel builds (#12096) @vyasr
- Allow setting malloc heap size in string udfs (#12094) @brandon-b-miller
- Support
strip,lstrip, andrstripinstrings_udf(#12091) @brandon-b-miller - Mark nvcomp zstd compression stable (#12059) @jbrennan333
- Add debug-only onAllocated/onDeallocated to RmmEventHandler (#12054) @abellina
- Enable building against the libarrow contained in pyarrow (#12034) @vyasr
- Add strings
likejni and native method (#12032) @cindyyuanjiang - Cleanup common parsing code in JSON, CSV reader (#12022) @karthikeyann
- byte_range support for JSON Lines format (#12017) @karthikeyann
- Minor cleanup of root CMakeLists.txt for better organization (#11988) @robertmaynard
- Add inplace arithmetic operators to
MaskedType(#11987) @brandon-b-miller - Implement JNI for chunked Parquet reader (#11961) @ttnghia
- Add method argument to DataFrame.quantile (#11957) @rjzamora
- Add gpu memory watermark apis to JNI (#11950) @abellina
- Adds retryCount to RmmEventHandler.onAllocFailure (#11940) @abellina
- Enable returning string data from UDFs used through
apply(#11933) @brandon-b-miller - Switch over to rapids-cmake patches for thrust (#11921) @robertmaynard
- Add strings udf C++ classes and functions for phase II (#11912) @davidwendt
- Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
- Enable CEC for
strings_udf(#11884) @brandon-b-miller - ArrowIPCTableWriter writes en empty batch in the case of an empty table. (#11883) @firestarman
- Implement chunked Parquet reader (#11867) @ttnghia
- Add
read_orc_metadatato libcudf (#11815) @vuule - Support nested types as groupby keys in libcudf (#11792) @PointKernel
- Adding feature Truncate to DataFrame and Series (#11435) @VamsiTallam95
π οΈ Improvements
- Reduce number of tests marked
spilling(#12197) @madsbk - Pin
daskanddistributedfor release (#12165) @galipremsagar - Don't rely on GNU find in headers_test.sh (#12164) @wence-
- Update cp.clip call (#12148) @quasiben
- Enable automatic column projection in groupby().agg (#12124) @rjzamora
- Refactor
purge_nonempty_nulls(#12111) @ttnghia - Create an
int8column inread_csvwhen all elements are missing (#12110) @vuule - Spilling to host memory (#12106) @madsbk
- First pass of
pd.read_orcchanges in tests (#12103) @galipremsagar - Expose engine argument in daskcudf.readjson (#12101) @rjzamora
- Remove CUDA 10 compatibility code. (#12088) @bdice
- Move and update
dasknigthly install in CI (#12082) @galipremsagar - Throw an error when libcudf is built without cuFile and
LIBCUDF_CUFILE_POLICYis set to"ALWAYS"(#12080) @vuule - Remove macros that inspect the contents of exceptions (#12076) @vyasr
- Fix ingestrawdata performance issue in Nested JSON reader due to RVO (#12070) @karthikeyann
- Remove overflow error during decimal binops (#12063) @galipremsagar
- Change cudf::detail::tdigest to cudf::tdigest::detail (#12050) @davidwendt
- Fix quantile gtests coded in namespace cudf::test (#12049) @davidwendt
- Add support for
DataFrame.from_dict`todictandSeries.todict` (#12048) @galipremsagar - Refactor Parquet reader (#12046) @ttnghia
- Forward merge 22.10 into 22.12 (#12045) @vyasr
- Standardize newlines at ends of files. (#12042) @bdice
- Trim trailing whitespace from all files. (#12041) @bdice
- Use nosync policy in gather and scatter implementations. (#12038) @bdice
- Remove smart quotes from all docstrings. (#12035) @bdice
- Update cuda-python dependency to 11.7.1 (#12030) @galipremsagar
- Add cython-lint to pre-commit checks. (#12020) @bdice
- Use pragma once (#12019) @bdice
- New GHA to add issues/prs to project board (#12016) @jarmak-nv
- Add DataFrame.pivot_table. (#12015) @bdice
- Rollback of
DeviceBufferLike(#12009) @madsbk - Remove default parameters for nvtext::detail functions (#12007) @davidwendt
- Remove default parameters for cudf::dictionary::detail functions (#12006) @davidwendt
- Remove unused
managed_allocator(#12005) @vyasr - Remove default parameters for cudf::strings::detail functions (#12003) @davidwendt
- Remove unnecessary code from dask-cudf _Frame (#12001) @rjzamora
- Ignore python docs build artifacts (#12000) @galipremsagar
- Use rapids-cmake for google benchmark. (#11997) @vyasr
- Leverage rapids_cython for more automated RPATH handling (#11996) @vyasr
- Remove stale labeler (#11995) @raydouglass
- Move protobuf compilation to CMake (#11986) @vyasr
- Replace most of preprocessor usage in nvcomp adapter with
constexpr(#11980) @vuule - Add missing noexcepts to columninmetadata methods (#11973) @vyasr
- Pass column names to
write_csvinstead oftable_metadatapointer (#11972) @vuule - Accelerate libcudf segmented sort with CUB segmented sort (#11969) @davidwendt
- Feature/remove default streams (#11967) @vyasr
- Add pool memory resource to libcudf basic example (#11966) @davidwendt
- Fix some libcudf calls to cudf::detail::gather (#11963) @davidwendt
- Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
- Add deprecation warning for set_allocator. (#11958) @vyasr
- Fix lists and structs gtests coded in namespace cudf::test (#11956) @davidwendt
- Add full page indexes to Parquet writer benchmarks (#11955) @etseidl
- Use gather-based strings factory in cudf::strings::strip (#11954) @davidwendt
- Default to equal NaNs in makemergesets_aggregation. (#11952) @bdice
- Add
strip_delimitersoption toread_text(#11946) @upsj - Refactor multibytesplit `outputbuilder` (#11945) @upsj
- Remove validation that requires introspection (#11938) @vyasr
- Add
.str.find_multipleAPI (#11928) @galipremsagar - Add regex_program class for use with all regex APIs (#11927) @davidwendt
- Enable backend dispatching for Dask-DataFrame creation (#11920) @rjzamora
- Performance improvement in JSON Tree traversal (#11919) @karthikeyann
- Fix some gtests incorrectly coded in namespace cudf::test (part I) (#11917) @davidwendt
- Refactor pad/zfill functions for reuse with strings udf (#11914) @davidwendt
- Add
nanosecondµsecondtoDatetimeProperties(#11911) @galipremsagar - Pin mimesis version in setup.py. (#11906) @bdice
- Error on
ListColumnor any new unsupported column incudf.Index(#11902) @galipremsagar - Add thrust output iterator fix (1805) to thrust.patch (#11900) @davidwendt
- Relax
codecovthreshold diff (#11899) @galipremsagar - Use public APIs in STREAMCOMPACTIONNVBENCH (#11892) @GregoryKimball
- Add coverage for string UDF tests. (#11891) @vyasr
- Provide
data_chunk_sourcewrapper fordatasource(#11886) @upsj - Handle
multibyte_splitbyte_range out-of-bounds offsets on host (#11885) @upsj - Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
- Change expectstringsempty into expectcolumnempty libcudf test utility (#11873) @davidwendt
- Add ngroup (#11871) @shwina
- Reduce memory usage in nested JSON parser - tree generation (#11864) @karthikeyann
- Unpin
daskanddistributedfor development (#11859) @galipremsagar - Remove unused includes for table/row_operators (#11857) @GregoryKimball
- Use conda-forge's
pyorc(#11855) @jakirkham - Add libcudf strings examples (#11849) @davidwendt
- Remove
cudf_ionamespace alias (#11827) @vuule - Test/remove thrust vector usage (#11813) @vyasr
- Add BGZIP reader to python
read_text(#11802) @upsj - Merge branch-22.10 into branch-22.12 (#11801) @davidwendt
- Fix compile warning from CUDFFUNCRANGE in a member function (#11798) @davidwendt
- Update cudf JNI version to 22.12.0-SNAPSHOT (#11764) @pxLi
- Update flake8 to 5.0.4 and use flake8-force to check Cython. (#11736) @bdice
- Add BGZIP multibyte_split benchmark (#11723) @upsj
- Bifurcate Dependency Lists (#11674) @bdice
- Default to equal NaNs in makecollectset_aggregation. (#11621) @bdice
- Conform "bench_isin" to match generator column names (#11549) @GregoryKimball
- Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
- Add checks for HLG layers in dask-cudf groupby tests (#10853) @charlesbluca
- part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
- Make all
nvccwarnings into errors (#8916) @trxcllnt
- C++
Published by GPUtester about 3 years ago
https://github.com/rapidsai/cudf - v22.12.00
π¨ Breaking Changes
- Add JNI for
substringwithout 'end' parameter. (#12113) @firestarman - Refactor
purge_nonempty_nulls(#12111) @ttnghia - Create an
int8column inread_csvwhen all elements are missing (#12110) @vuule - Throw an error when libcudf is built without cuFile and
LIBCUDF_CUFILE_POLICYis set to"ALWAYS"(#12080) @vuule - Fix type promotion edge cases in numerical binops (#12074) @wence-
- Reduce/Remove reliance on
**kwargsand*argsinIOreaders & writers (#12025) @galipremsagar - Rollback of
DeviceBufferLike(#12009) @madsbk - Remove unused
managed_allocator(#12005) @vyasr - Pass column names to
write_csvinstead oftable_metadatapointer (#11972) @vuule - Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
- Default to equal NaNs in makemergesets_aggregation. (#11952) @bdice
- Remove validation that requires introspection (#11938) @vyasr
- Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
- Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
- Support nested types as groupby keys in libcudf (#11792) @PointKernel
- Default to equal NaNs in makecollectset_aggregation. (#11621) @bdice
- Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
- part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
π Bug Fixes
- Fix include line for IO Cython modules (#12250) @vyasr
- Make dask pinning looser (#12231) @vyasr
- Workaround for CUB segmented-sort bug with boolean keys (#12217) @davidwendt
- Fix
from_dictbackend dispatch to match upstreamdask(#12203) @galipremsagar - Merge branch-22.10 into branch-22.12 (#12198) @davidwendt
- Fix compression in ORC writer (#12194) @vuule
- Don't use CMake 3.25.0 as it has a show stopping FindCUDAToolkit bug (#12188) @robertmaynard
- Fix data corruption when reading ORC files with empty stripes (#12160) @vuule
- Fix decimal binary operations (#12142) @galipremsagar
- Ensure dlpack include is provided to cudf interop lib (#12139) @robertmaynard
- Safely allocate
udf_stringpointers instrings_udf(#12138) @brandon-b-miller - Fix/disable jitify lto (#12122) @robertmaynard
- Fix conditionalfulljoin benchmark (#12121) @GregoryKimball
- Fix regex working-memory-size refactor error (#12119) @davidwendt
- Add in negative size checks for columns (#12118) @revans2
- Add JNI for
substringwithout 'end' parameter. (#12113) @firestarman - Fix reading of CSV files with blank second row (#12098) @vuule
- Fix an error in IO with
GzipFiletype (#12085) @galipremsagar - Workaround groupby aggregate thrust::copy_if overflow (#12079) @davidwendt
- Fix alignment of compressed blocks in ORC writer (#12077) @vuule
- Fix singleton-range
__setitem__edge case (#12075) @wence- - Fix type promotion edge cases in numerical binops (#12074) @wence-
- Force using old fmt in nvbench. (#12067) @vyasr
- Fixes List offset bug in Nested JSON reader (#12060) @karthikeyann
- Allow falling back to
shim_60.ptxby default instrings_udf(#12056) @brandon-b-miller - Force black exclusions for pre-commit. (#12036) @bdice
- Add
memory_usage&itemsimplementation forStructcolumn & dtype (#12033) @galipremsagar - Reduce/Remove reliance on
**kwargsand*argsinIOreaders & writers (#12025) @galipremsagar - Fixes bug in csvreaderoptions construction in cython (#12021) @karthikeyann
- Fix issues when both
usecolsandnamesoptions are used inread_csv(#12018) @vuule - Port thrust's pinned_allocator to cudf, since Thrust 1.17 removes the type (#12004) @robertmaynard
- Revert "Replace most of preprocessor usage in nvcomp adapter with
constexpr" (#11999) @vuule - Fix bug where
df.locresulting in single row could give wrong index (#11998) @eriknw - Switch to DISABLEDEPRECATIONWARNINGS to match other RAPIDS projects (#11989) @robertmaynard
- Fix maximum page size estimate in Parquet writer (#11962) @vuule
- Fix local offset handling in bgzip reader (#11918) @upsj
- Fix an issue reading struct-of-list types in Parquet. (#11910) @nvdbaranec
- Fix memcheck error in TypeInference.Timestamp gtest (#11905) @davidwendt
- Fix type casting in Series.setitem (#11904) @wence-
- Fix memcheck error in getdremeldata (#11903) @davidwendt
- Fixes Unsupported column type error due to empty list columns in Nested JSON reader (#11897) @karthikeyann
- Fix segmented-sort to ignore indices outside the offsets (#11888) @davidwendt
- Fix cudf::stablesortedorder for NaN and -NaN in FLOAT64 columns (#11874) @davidwendt
- Fix writing of Parquet files with many fragments (#11869) @etseidl
- Fix RangeIndex unary operators. (#11868) @vyasr
- JNI Avoid NPE for reading host binary data (#11865) @revans2
- Fix decimal benchmark input data generation (#11863) @karthikeyann
- Fix pre-commit copyright check (#11860) @galipremsagar
- Fix Parquet support for seconds and milliseconds duration types (#11854) @vuule
- Ensure better compiler cache results between cudf cal-ver branches (#11835) @robertmaynard
- Fix makecolumnfrom_scalar for all-null strings column (#11807) @davidwendt
- Tell jitify_preprocess where to search for libnvrtc (#11787) @robertmaynard
- add V2 page header support to parquet reader (#11778) @etseidl
- Parquet reader: bug fix for a numrows/skiprows corner case, w/optimization for nested preprocessing (#11752) @nvdbaranec
- Determine if Arrow has S3 support at runtime in unit test. (#11560) @bdice
π Documentation
- Use rapidsai CODEOFCONDUCT.md (#12166) @bdice
- Add symlinks to notebooks. (#12128) @bdice
- Add
truncateAPI to python doc pages (#12109) @galipremsagar - Update Numba docs links. (#12107) @bdice
- Remove "Multi-GPU with Dask-cuDF" notebook. (#12095) @bdice
- Fix link to c++ developer guide from
CONTRIBUTING.md(#12084) @brandon-b-miller - Add pivot_table and crosstab to docs. (#12014) @bdice
- Fix doxygen text for cudf::dictionary::encode (#11991) @davidwendt
- Replace defaultstreamvalue with getdefaultstream in docs. (#11985) @vyasr
- Add dtype docs pages and docstrings for
cudfspecific dtypes (#11974) @galipremsagar - Update Unit Testing in libcudf guidelines to code tests outside the cudf::test namespace (#11959) @davidwendt
- Rename libcudf++ to libcudf. (#11953) @bdice
- Fix documentation referring to removed asgpumatrix method. (#11937) @bdice
- Remove "experimental" warning for struct columns in ORC reader and writer (#11880) @vuule
- Initial draft of policies and guidelines for libcudf usage. (#11853) @vyasr
- Add clear indication of non-GPU accelerated parameters in read_json docstring (#11825) @GregoryKimball
- Add developer docs for writing tests (#11199) @vyasr
π New Features
- Adds an EventHandler to Java MemoryBuffer to be invoked on close (#12125) @abellina
- Support
+instrings_udf(#12117) @brandon-b-miller - Support
upperandlowerinstrings_udf(#12099) @brandon-b-miller - Add wheel builds (#12096) @vyasr
- Allow setting malloc heap size in string udfs (#12094) @brandon-b-miller
- Support
strip,lstrip, andrstripinstrings_udf(#12091) @brandon-b-miller - Mark nvcomp zstd compression stable (#12059) @jbrennan333
- Add debug-only onAllocated/onDeallocated to RmmEventHandler (#12054) @abellina
- Enable building against the libarrow contained in pyarrow (#12034) @vyasr
- Add strings
likejni and native method (#12032) @cindyyuanjiang - Cleanup common parsing code in JSON, CSV reader (#12022) @karthikeyann
- byte_range support for JSON Lines format (#12017) @karthikeyann
- Minor cleanup of root CMakeLists.txt for better organization (#11988) @robertmaynard
- Add inplace arithmetic operators to
MaskedType(#11987) @brandon-b-miller - Implement JNI for chunked Parquet reader (#11961) @ttnghia
- Add method argument to DataFrame.quantile (#11957) @rjzamora
- Add gpu memory watermark apis to JNI (#11950) @abellina
- Adds retryCount to RmmEventHandler.onAllocFailure (#11940) @abellina
- Enable returning string data from UDFs used through
apply(#11933) @brandon-b-miller - Switch over to rapids-cmake patches for thrust (#11921) @robertmaynard
- Add strings udf C++ classes and functions for phase II (#11912) @davidwendt
- Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
- Enable CEC for
strings_udf(#11884) @brandon-b-miller - ArrowIPCTableWriter writes en empty batch in the case of an empty table. (#11883) @firestarman
- Implement chunked Parquet reader (#11867) @ttnghia
- Add
read_orc_metadatato libcudf (#11815) @vuule - Support nested types as groupby keys in libcudf (#11792) @PointKernel
- Adding feature Truncate to DataFrame and Series (#11435) @VamsiTallam95
π οΈ Improvements
- Reduce number of tests marked
spilling(#12197) @madsbk - Pin
daskanddistributedfor release (#12165) @galipremsagar - Don't rely on GNU find in headers_test.sh (#12164) @wence-
- Update cp.clip call (#12148) @quasiben
- Enable automatic column projection in groupby().agg (#12124) @rjzamora
- Refactor
purge_nonempty_nulls(#12111) @ttnghia - Create an
int8column inread_csvwhen all elements are missing (#12110) @vuule - Spilling to host memory (#12106) @madsbk
- First pass of
pd.read_orcchanges in tests (#12103) @galipremsagar - Expose engine argument in daskcudf.readjson (#12101) @rjzamora
- Remove CUDA 10 compatibility code. (#12088) @bdice
- Move and update
dasknigthly install in CI (#12082) @galipremsagar - Throw an error when libcudf is built without cuFile and
LIBCUDF_CUFILE_POLICYis set to"ALWAYS"(#12080) @vuule - Remove macros that inspect the contents of exceptions (#12076) @vyasr
- Fix ingestrawdata performance issue in Nested JSON reader due to RVO (#12070) @karthikeyann
- Remove overflow error during decimal binops (#12063) @galipremsagar
- Change cudf::detail::tdigest to cudf::tdigest::detail (#12050) @davidwendt
- Fix quantile gtests coded in namespace cudf::test (#12049) @davidwendt
- Add support for
DataFrame.from_dict`todictandSeries.todict` (#12048) @galipremsagar - Refactor Parquet reader (#12046) @ttnghia
- Forward merge 22.10 into 22.12 (#12045) @vyasr
- Standardize newlines at ends of files. (#12042) @bdice
- Trim trailing whitespace from all files. (#12041) @bdice
- Use nosync policy in gather and scatter implementations. (#12038) @bdice
- Remove smart quotes from all docstrings. (#12035) @bdice
- Update cuda-python dependency to 11.7.1 (#12030) @galipremsagar
- Add cython-lint to pre-commit checks. (#12020) @bdice
- Use pragma once (#12019) @bdice
- New GHA to add issues/prs to project board (#12016) @jarmak-nv
- Add DataFrame.pivot_table. (#12015) @bdice
- Rollback of
DeviceBufferLike(#12009) @madsbk - Remove default parameters for nvtext::detail functions (#12007) @davidwendt
- Remove default parameters for cudf::dictionary::detail functions (#12006) @davidwendt
- Remove unused
managed_allocator(#12005) @vyasr - Remove default parameters for cudf::strings::detail functions (#12003) @davidwendt
- Remove unnecessary code from dask-cudf _Frame (#12001) @rjzamora
- Ignore python docs build artifacts (#12000) @galipremsagar
- Use rapids-cmake for google benchmark. (#11997) @vyasr
- Leverage rapids_cython for more automated RPATH handling (#11996) @vyasr
- Remove stale labeler (#11995) @raydouglass
- Move protobuf compilation to CMake (#11986) @vyasr
- Replace most of preprocessor usage in nvcomp adapter with
constexpr(#11980) @vuule - Add missing noexcepts to columninmetadata methods (#11973) @vyasr
- Pass column names to
write_csvinstead oftable_metadatapointer (#11972) @vuule - Accelerate libcudf segmented sort with CUB segmented sort (#11969) @davidwendt
- Feature/remove default streams (#11967) @vyasr
- Add pool memory resource to libcudf basic example (#11966) @davidwendt
- Fix some libcudf calls to cudf::detail::gather (#11963) @davidwendt
- Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
- Add deprecation warning for set_allocator. (#11958) @vyasr
- Fix lists and structs gtests coded in namespace cudf::test (#11956) @davidwendt
- Add full page indexes to Parquet writer benchmarks (#11955) @etseidl
- Use gather-based strings factory in cudf::strings::strip (#11954) @davidwendt
- Default to equal NaNs in makemergesets_aggregation. (#11952) @bdice
- Add
strip_delimitersoption toread_text(#11946) @upsj - Refactor multibytesplit `outputbuilder` (#11945) @upsj
- Remove validation that requires introspection (#11938) @vyasr
- Add
.str.find_multipleAPI (#11928) @galipremsagar - Add regex_program class for use with all regex APIs (#11927) @davidwendt
- Enable backend dispatching for Dask-DataFrame creation (#11920) @rjzamora
- Performance improvement in JSON Tree traversal (#11919) @karthikeyann
- Fix some gtests incorrectly coded in namespace cudf::test (part I) (#11917) @davidwendt
- Refactor pad/zfill functions for reuse with strings udf (#11914) @davidwendt
- Add
nanosecondµsecondtoDatetimeProperties(#11911) @galipremsagar - Pin mimesis version in setup.py. (#11906) @bdice
- Error on
ListColumnor any new unsupported column incudf.Index(#11902) @galipremsagar - Add thrust output iterator fix (1805) to thrust.patch (#11900) @davidwendt
- Relax
codecovthreshold diff (#11899) @galipremsagar - Use public APIs in STREAMCOMPACTIONNVBENCH (#11892) @GregoryKimball
- Add coverage for string UDF tests. (#11891) @vyasr
- Provide
data_chunk_sourcewrapper fordatasource(#11886) @upsj - Handle
multibyte_splitbyte_range out-of-bounds offsets on host (#11885) @upsj - Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
- Change expectstringsempty into expectcolumnempty libcudf test utility (#11873) @davidwendt
- Add ngroup (#11871) @shwina
- Reduce memory usage in nested JSON parser - tree generation (#11864) @karthikeyann
- Unpin
daskanddistributedfor development (#11859) @galipremsagar - Remove unused includes for table/row_operators (#11857) @GregoryKimball
- Use conda-forge's
pyorc(#11855) @jakirkham - Add libcudf strings examples (#11849) @davidwendt
- Remove
cudf_ionamespace alias (#11827) @vuule - Test/remove thrust vector usage (#11813) @vyasr
- Add BGZIP reader to python
read_text(#11802) @upsj - Merge branch-22.10 into branch-22.12 (#11801) @davidwendt
- Fix compile warning from CUDFFUNCRANGE in a member function (#11798) @davidwendt
- Update cudf JNI version to 22.12.0-SNAPSHOT (#11764) @pxLi
- Update flake8 to 5.0.4 and use flake8-force to check Cython. (#11736) @bdice
- Add BGZIP multibyte_split benchmark (#11723) @upsj
- Bifurcate Dependency Lists (#11674) @bdice
- Default to equal NaNs in makecollectset_aggregation. (#11621) @bdice
- Conform "bench_isin" to match generator column names (#11549) @GregoryKimball
- Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
- Add checks for HLG layers in dask-cudf groupby tests (#10853) @charlesbluca
- part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
- Make all
nvccwarnings into errors (#8916) @trxcllnt
- C++
Published by GPUtester about 3 years ago
https://github.com/rapidsai/cudf - [NIGHTLY] v22.10.00
π Links
π¨ Breaking Changes
- Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
- Disable nvCOMP DEFLATE integration (#11811) @vuule
- Fix return type of
Index.isna&Index.notna(#11769) @galipremsagar - Remove
kwargsinread_csv&to_csv(#11762) @galipremsagar - Fix
cudf::partition*APIs that do not return offsets for empty output table (#11709) @ttnghia - Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
- Update zfill to match Python output (#11634) @davidwendt
- Upgrade
pandasto1.5(#11617) @galipremsagar - Change default value of
orderedtoFalseinCategoricalDtype(#11604) @galipremsagar - Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
- Adding optional parquet reader schema (#11524) @hyperbolic2346
- Deprecate
skiprowsandnum_rowsinread_orc(#11522) @galipremsagar - Remove support for skiprows / numrows options in the parquet reader. (#11503) @nvdbaranec
- Drop support for
skiprowsandnum_rowsincudf.read_parquet(#11480) @galipremsagar - Disable Arrow S3 support by default. (#11470) @bdice
- Convert thrust::optional usages to std::optional (#11455) @robertmaynard
- Remove unused is_struct trait. (#11450) @bdice
- Refactor the
Bufferclass (#11447) @madsbk - Return empty dataframe when reading an ORC file using empty
columnsoption (#11446) @vuule - Refactor padside and striptype enums into side_type enum (#11438) @davidwendt
- Remove HASHSERIALMURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
- Use the new JSON parser when the experimental reader is selected (#11364) @vuule
- Remove deprecated Series.applymap. (#11031) @bdice
- Remove deprecated expand parameter from str.findall. (#11030) @bdice
π Bug Fixes
- Force using old fmt in nvbench. (#12064) @vyasr
- Update cuda-python dependency to 11.7.1 (#11994) @shwina
- Fixes bug in temporary decompression space estimation before calling nvcomp (#11879) @abellina
- Handle
ptxfile paths duringstrings_udfimport (#11862) @galipremsagar - Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
- Reset
strings_udfCEC and solve several related issues (#11846) @brandon-b-miller - Fix bug in new shuffle-based groupby implementation (#11836) @rjzamora
- Fix
is_validchecks inScalar._binaryop(#11818) @wence- - Fix operator
NotImplementedissue withnumpy(#11816) @galipremsagar - Disable nvCOMP DEFLATE integration (#11811) @vuule
- Build
strings_udfpackage with other python packages in nightlies (#11808) @brandon-b-miller - Revert problematic shuffle=explicit-comms changes (#11803) @rjzamora
- Fix regex out-of-bounds write in strided rows logic (#11797) @davidwendt
- Build
cudflocally before buildingstrings_udfconda packages in CI (#11785) @brandon-b-miller - Fix an issue in cudf::rowbitcount involving structs and lists at multiple levels. (#11779) @nvdbaranec
- Fix return type of
Index.isna&Index.notna(#11769) @galipremsagar - Fix issue with set-item incase of
listandstructtypes (#11760) @galipremsagar - Ensure all libcudf APIs run on cudf's default stream (#11759) @vyasr
- Resolve dask_cudf failures caused by upstream groupby changes (#11755) @rjzamora
- Fix ORC string sum statistics (#11740) @vuule
- Add
strings_udfpackage for python 3.9 (#11730) @brandon-b-miller - Ensure that all tests launch kernels on cudf's default stream (#11726) @vyasr
- Don't assume stream is a compile-time constant expression (#11725) @vyasr
- Fix get_thrust.cmake format at patch command (#11715) @davidwendt
- Fix
cudf::partition*APIs that do not return offsets for empty output table (#11709) @ttnghia - Fix cudf::lists::sort_lists for NaN and Infinity values (#11703) @davidwendt
- Modify ORC reader timestamp parsing to match the apache reader behavior (#11699) @vuule
- Fix
DataFrame.from_arrowto preserve type metadata (#11698) @galipremsagar - Fix compile error due to missing header (#11697) @ttnghia
- Default to Snappy compression in
to_orcwhen using cuDF or Dask (#11690) @vuule - Fix an issue related to
Multindexwhengroup_keys=True(#11689) @galipremsagar - Transfer correct dtype to exploded column (#11687) @wence-
- Ignore protobuf generated files in
mypychecks (#11685) @galipremsagar - Maintain the index name after
.loc(#11677) @shwina - Fix issue with extracting nested column data & dtype preservation (#11671) @galipremsagar
- Ensure that all cudf tests and benchmarks are conda env aware (#11666) @robertmaynard
- Update to Thrust 1.17.2 to fix cub ODR issues (#11665) @robertmaynard
- Fix multi-file remote datasource bug (#11655) @rjzamora
- Fix invalid regex quantifier check to not include alternation (#11654) @davidwendt
- Fix bug in
device_write(): it uses an incorrect size (#11651) @madsbk - fixes overflows in benchmarks (#11649) @elstehle
- Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
- Fix compile error in benchmark nested_json.cpp (#11637) @davidwendt
- Update zfill to match Python output (#11634) @davidwendt
- Removed converted type for INT32 and INT64 since they do not convert (#11627) @hyperbolic2346
- Fix host scalars construction of nested types (#11612) @galipremsagar
- Fix compile warning in nestedjsongpu.cu (#11607) @davidwendt
- Change default value of
orderedtoFalseinCategoricalDtype(#11604) @galipremsagar - Preserve order if necessary when deduping categoricals internally (#11597) @brandon-b-miller
- Add is_timestamp test for leap second (60) (#11594) @davidwendt
- Fix an issue with
to_arrowwhen column name type is not a string (#11590) @galipremsagar - Fix exception in segmented-reduce benchmark (#11588) @davidwendt
- Fix encode/decode of negative timestamps in ORC reader/writer (#11586) @vuule
- Correct distribution data type in
quantilesbenchmark (#11584) @vuule - Fix multibyte_split benchmark for host buffers (#11583) @upsj
- xfail custreamz display test for now (#11567) @shwina
- Fix JNI for TableWithMeta to use schemainfo instead of columnnames (#11566) @jlowe
- Reduce code duplication for
dask&distributednightly/stable installs (#11565) @galipremsagar - Fix groupby failures in dask_cudf CI (#11561) @rjzamora
- Fix for pivot: error when 'values' is a multicharacter string (#11538) @shaswat-indian
- find_package(cudf) + arrow9 usable with cudf build directory (#11535) @robertmaynard
- Fixing crash when writing binary nested data in parquet (#11526) @hyperbolic2346
- Fix for: error when assigning a value to an empty series (#11523) @shaswat-indian
- Fix invalid results from conditional-left-anti-join in debug build (#11517) @davidwendt
- Fix cmake error after upgrading to Arrow 9 (#11513) @ttnghia
- Fix reverse binary operators acting on a host value and cudf.Scalar (#11512) @bdice
- Update parquet fuzz tests to drop support for
skiprows&num_rows(#11505) @galipremsagar - Use rapids-cmake 22.10 best practice for RAPIDS.cmake location (#11493) @robertmaynard
- Handle some zero-sized corner cases in dlpack interop (#11449) @wence-
- Return empty dataframe when reading an ORC file using empty
columnsoption (#11446) @vuule - libcudf c++ example updated to CPM version 0.35.3 (#11417) @robertmaynard
- Fix regex quantifier check to include capture groups (#11373) @davidwendt
- Fix readtext when byterange is aligned with field (#11371) @upsj
- Fix to_timestamps truncated subsecond calculation (#11367) @davidwendt
- column: calculate null_count before release()ing the cudf::column (#11365) @wence-
π Documentation
- Update
guide-to-udfsnotebook (#11861) @brandon-b-miller - Update docstring for cudf.read_text (#11799) @GregoryKimball
- Add doc section for
list&structhandling (#11770) @galipremsagar - Document that minimum required CMake version is now 3.23.1 (#11751) @robertmaynard
- Update libcudf documentation build command in DOCUMENTATION.md (#11735) @davidwendt
- Add docs for use of string data to
DataFrame.applyandSeries.applyand update guide to UDFs notebook (#11733) @brandon-b-miller - Enable more Pydocstyle rules (#11582) @bdice
- Remove unused cpp/img folder (#11554) @davidwendt
- Publish C++ developer docs (#11475) @vyasr
- Fix a misalignment in
cudf.get_dummiesdocstring (#11443) @galipremsagar - Update contributing doc to include links to the developer guides (#11390) @davidwendt
- Fix tableviewbase doxygen format (#11340) @davidwendt
- Create main developer guide for Python (#11235) @vyasr
- Add developer documentation for benchmarking (#11122) @vyasr
- cuDF error handling document (#7917) @isVoid
π New Features
- Add hasNull statistic reading ability to ORC (#11747) @devavret
- Add
istitleto string UDFs (#11738) @brandon-b-miller - JSON Column creation in GPU (#11714) @karthikeyann
- Adds option to take explicit nested schema for nested JSON reader (#11682) @elstehle
- Add BGZIP
data_chunk_reader(#11652) @upsj - Support DECIMAL order-by for RANGE window functions (#11645) @mythrocks
- changing version of cmake to 3.23.3 (#11619) @hyperbolic2346
- Generate unique keys table in java JNI
contiguousSplitGroups(#11614) @res-life - Generic type casting to support the new nested JSON reader (#11613) @elstehle
- JSON tree traversal (#11610) @karthikeyann
- Add casting operators to masked UDFs (#11578) @brandon-b-miller
- Adds type inference and type conversion for leaf-columns to the nested JSON parser (#11574) @elstehle
- Add strings 'like' function (#11558) @davidwendt
- Handle hyphen as literal for regex cclass when incomplete range (#11557) @davidwendt
- Enable ZSTD compression in ORC and Parquet writers (#11551) @vuule
- Adds support for json lines format to the nested JSON reader (#11534) @elstehle
- Adding optional parquet reader schema (#11524) @hyperbolic2346
- Adds GPU implementation of JSON-token-stream to JSON-tree (#11518) @karthikeyann
- Add
gdbpretty-printers for simple types (#11499) @upsj - Add
create_random_columnfunction to the data generator (#11490) @vuule - Add fluent API builder to
data_profile(#11479) @vuule - Adds Nested Json benchmark (#11466) @karthikeyann
- Convert thrust::optional usages to std::optional (#11455) @robertmaynard
- Python API for the future experimental JSON reader (#11426) @vuule
- Return schema info from JSON reader (#11419) @vuule
- Add regex ASCII flag support for matching builtin character classes (#11404) @davidwendt
- Truncate parquet column indexes (#11403) @etseidl
- Adds the end-to-end JSON parser implementation (#11388) @elstehle
- Use the new JSON parser when the experimental reader is selected (#11364) @vuule
- Add placeholder for the experimental JSON reader (#11334) @vuule
- Add read-only functions on string dtypes to
DataFrame.applyandSeries.apply(#11319) @brandon-b-miller - Added 'crosstab' and 'pivot_table' features (#11314) @shaswat-indian
- Quickly error out when trying to build with unsupported nvcc versions (#11297) @robertmaynard
- Adds JSON tokenizer (#11264) @elstehle
- List lexicographic comparator (#11129) @devavret
- Add generic type inference for cuIO (#11121) @PointKernel
- Fully support nested types in
cudf::contains(#10656) @ttnghia - Support nested types in
lists::contains(#10548) @ttnghia
π οΈ Improvements
- Pin
daskanddistributedfor release (#11822) @galipremsagar - Add examples for Nested JSON reader (#11814) @GregoryKimball
- Support shuffle-based groupby aggregations in dask_cudf (#11800) @rjzamora
- Update strings udf version updater script (#11772) @galipremsagar
- Remove
kwargsinread_csv&to_csv(#11762) @galipremsagar - Pass
dtypeparam to avoidpd.Serieswarnings (#11761) @galipremsagar - Enable
schema_element&keep_quotessupport in json reader (#11746) @galipremsagar - Add ability to construct
ListColumnwhen size isNone(#11745) @galipremsagar - Reduces memory requirements in JSON parser and adds bytes/s and peak memory usage to benchmarks (#11732) @elstehle
- Add missing copyright headers. (#11712) @bdice
- Fix copyright check issues in pre-commit (#11711) @bdice
- Include decimal in supported types for range window order-by columns (#11710) @mythrocks
- Disable very large column gtest for contiguous-split (#11706) @davidwendt
- Drop split_out=None test from groupby.agg (#11704) @wence-
- Use CubinLinker for CUDA Minor Version Compatibility (#11701) @gmarkall
- Add regex capture-group parameter to auto convert to non-capture groups (#11695) @davidwendt
- Add a
__dataframe__method to the protocol dataframe object (#11692) @rgommers - Special-case multibyte_split for single-byte delimiter (#11681) @upsj
- Remove isort exclusions (#11680) @bdice
- Refactor CSV reader benchmarks with nvbench (#11678) @PointKernel
- Check conda recipe headers with pre-commit (#11669) @bdice
- Remove redundant style check for clang-format. (#11668) @bdice
- Add support for
group_keysingroupby(#11659) @galipremsagar - Fix pandoc pinning. (#11658) @bdice
- Revert removal of skiprows / numrows options from the Parquet reader. (#11657) @nvdbaranec
- Update git metadata (#11647) @bdice
- Call setnullcount on a returning column if null-count is known (#11646) @davidwendt
- Fix some libcudf detail calls not passing the stream variable (#11642) @davidwendt
- Update to mypy 0.971 (#11640) @wence-
- Refactor strings strip functor to details header (#11635) @davidwendt
- Fix incorrect
nullCountinget_json_object(#11633) @trxcllnt - Simplify
hostdevice_vector(#11631) @upsj - Refactor parquet writer benchmarks with nvbench (#11623) @PointKernel
- Rework contains_scalar to check nulls at runtime (#11622) @davidwendt
- Fix incorrect memory resource used in rolling temp columns (#11618) @mythrocks
- Upgrade
pandasto1.5(#11617) @galipremsagar - Move type-dispatcher calls from traits.hpp to traits.cpp (#11616) @davidwendt
- Refactor parquet reader benchmarks with nvbench (#11611) @PointKernel
- Forward-merge branch-22.08 to branch-22.10 (#11608) @bdice
- Use stream in Java API. (#11601) @bdice
- Refactors of public/detail APIs, CUDFFUNCRANGE, stream handling. (#11600) @bdice
- Improve ORC writer benchmark with nvbench (#11598) @PointKernel
- Tune multibyte_split kernel (#11587) @upsj
- Move split_utils.cuh to strings/detail (#11585) @davidwendt
- Fix warnings due to compiler regression with
if constexpr(#11581) @ttnghia - Add full 24-bit dictionary support to Parquet writer (#11580) @etseidl
- Expose "explicit-comms" option in shuffle-based dask_cudf functions (#11576) @rjzamora
- Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
- Refactor daskcudf groupby to use applyconcat_apply (#11571) @rjzamora
- Add ability to write
list(struct)columns asmaptype in orc writer (#11568) @galipremsagar - Add byterange to multibytesplit benchmark + NVBench refactor (#11562) @upsj
- JNI support for writing binary columns in parquet (#11556) @revans2
- Support additional dictionary bit widths in Parquet writer (#11547) @etseidl
- Refactor string/numeric conversion utilities (#11545) @davidwendt
- Removing unnecessary asserts in parquet tests (#11544) @hyperbolic2346
- Clean up ORC reader benchmarks with NVBench (#11543) @PointKernel
- Reuse MurmurHash3_32 in Parquet page data. (#11528) @bdice
- Add hexadecimal value separators (#11527) @bdice
- Deprecate
skiprowsandnum_rowsinread_orc(#11522) @galipremsagar - Struct support for
NULL_EQUALSbinary operation (#11520) @rwlee - Bump hadoop-common from 3.2.3 to 3.2.4 in /java (#11516) @dependabot[bot]
- Fix Feather test warning. (#11511) @bdice
- copyrange ballotsyncs to have no execution dependency (#11508) @robertmaynard
- Upgrade to
arrow-9.x(#11507) @galipremsagar - Remove support for skiprows / numrows options in the parquet reader. (#11503) @nvdbaranec
- Single-pass
multibyte_split(#11500) @upsj - Sanitize percentile_approx() output for empty input (#11498) @SrikarVanavasam
- Unpin
daskanddistributedfor development (#11492) @galipremsagar - Move SparkMurmurHash3_32 functor. (#11489) @bdice
- Refactor group_nunique.cu to use nullate::DYNAMIC for reduce-by-key functor (#11482) @davidwendt
- Drop support for
skiprowsandnum_rowsincudf.read_parquet(#11480) @galipremsagar - Add reduction
distinct_countbenchmark (#11473) @ttnghia - Add groupby
nuniqueaggregation benchmark (#11472) @ttnghia - Disable Arrow S3 support by default. (#11470) @bdice
- Add groupby
maxaggregation benchmark (#11464) @ttnghia - Extract Dremel encoding code from Parquet (#11461) @vyasr
- Add missing Thrust #includes. (#11457) @bdice
- Make CMake hooks verbose (#11456) @vyasr
- Control Parquet page size through Python API (#11454) @etseidl
- Add control of Parquet column index creation to python (#11453) @etseidl
- Remove unused is_struct trait. (#11450) @bdice
- Refactor the
Bufferclass (#11447) @madsbk - Refactor padside and striptype enums into side_type enum (#11438) @davidwendt
- Update to Thrust 1.17.0 (#11437) @bdice
- Add in JNI for parsing JSON data and getting the metadata back too. (#11431) @revans2
- Convert bytearrayview to use std::byte (#11424) @hyperbolic2346
- Deprecate unflattennestedcolumns (#11421) @SrikarVanavasam
- Remove HASHSERIALMURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
- Add Spark list hashing Java tests (#11379) @bdice
- Move cmake to the build section. (#11376) @vyasr
- Remove use of CUDA driver API calls from libcudf (#11370) @shwina
- Add column constructor from device_uvector&& (#11356) @SrikarVanavasam
- Remove unused custreamz thirdparty directory (#11343) @vyasr
- Update jni version to 22.10.0-SNAPSHOT (#11338) @pxLi
- Enable using upstream jitify2 (#11287) @shwina
- Cache cudf.Scalar (#11246) @shwina
- Remove deprecated Series.applymap. (#11031) @bdice
- Remove deprecated expand parameter from str.findall. (#11030) @bdice
- C++
Published by rapids-bot[bot] over 3 years ago
https://github.com/rapidsai/cudf - v22.10.01
π¨ Breaking Changes
- Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
- Disable nvCOMP DEFLATE integration (#11811) @vuule
- Fix return type of
Index.isna&Index.notna(#11769) @galipremsagar - Remove
kwargsinread_csv&to_csv(#11762) @galipremsagar - Fix
cudf::partition*APIs that do not return offsets for empty output table (#11709) @ttnghia - Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
- Update zfill to match Python output (#11634) @davidwendt
- Upgrade
pandasto1.5(#11617) @galipremsagar - Change default value of
orderedtoFalseinCategoricalDtype(#11604) @galipremsagar - Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
- Adding optional parquet reader schema (#11524) @hyperbolic2346
- Deprecate
skiprowsandnum_rowsinread_orc(#11522) @galipremsagar - Remove support for skiprows / numrows options in the parquet reader. (#11503) @nvdbaranec
- Drop support for
skiprowsandnum_rowsincudf.read_parquet(#11480) @galipremsagar - Disable Arrow S3 support by default. (#11470) @bdice
- Convert thrust::optional usages to std::optional (#11455) @robertmaynard
- Remove unused is_struct trait. (#11450) @bdice
- Refactor the
Bufferclass (#11447) @madsbk - Return empty dataframe when reading an ORC file using empty
columnsoption (#11446) @vuule - Refactor padside and striptype enums into side_type enum (#11438) @davidwendt
- Remove HASHSERIALMURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
- Use the new JSON parser when the experimental reader is selected (#11364) @vuule
- Remove deprecated Series.applymap. (#11031) @bdice
- Remove deprecated expand parameter from str.findall. (#11030) @bdice
π Bug Fixes
- Update cuda-python dependency to 11.7.1 (#11994) @shwina
- Fixes bug in temporary decompression space estimation before calling nvcomp (#11879) @abellina
- Handle
ptxfile paths duringstrings_udfimport (#11862) @galipremsagar - Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
- Reset
strings_udfCEC and solve several related issues (#11846) @brandon-b-miller - Fix bug in new shuffle-based groupby implementation (#11836) @rjzamora
- Fix
is_validchecks inScalar._binaryop(#11818) @wence- - Fix operator
NotImplementedissue withnumpy(#11816) @galipremsagar - Disable nvCOMP DEFLATE integration (#11811) @vuule
- Build
strings_udfpackage with other python packages in nightlies (#11808) @brandon-b-miller - Revert problematic shuffle=explicit-comms changes (#11803) @rjzamora
- Fix regex out-of-bounds write in strided rows logic (#11797) @davidwendt
- Build
cudflocally before buildingstrings_udfconda packages in CI (#11785) @brandon-b-miller - Fix an issue in cudf::rowbitcount involving structs and lists at multiple levels. (#11779) @nvdbaranec
- Fix return type of
Index.isna&Index.notna(#11769) @galipremsagar - Fix issue with set-item incase of
listandstructtypes (#11760) @galipremsagar - Ensure all libcudf APIs run on cudf's default stream (#11759) @vyasr
- Resolve dask_cudf failures caused by upstream groupby changes (#11755) @rjzamora
- Fix ORC string sum statistics (#11740) @vuule
- Add
strings_udfpackage for python 3.9 (#11730) @brandon-b-miller - Ensure that all tests launch kernels on cudf's default stream (#11726) @vyasr
- Don't assume stream is a compile-time constant expression (#11725) @vyasr
- Fix get_thrust.cmake format at patch command (#11715) @davidwendt
- Fix
cudf::partition*APIs that do not return offsets for empty output table (#11709) @ttnghia - Fix cudf::lists::sort_lists for NaN and Infinity values (#11703) @davidwendt
- Modify ORC reader timestamp parsing to match the apache reader behavior (#11699) @vuule
- Fix
DataFrame.from_arrowto preserve type metadata (#11698) @galipremsagar - Fix compile error due to missing header (#11697) @ttnghia
- Default to Snappy compression in
to_orcwhen using cuDF or Dask (#11690) @vuule - Fix an issue related to
Multindexwhengroup_keys=True(#11689) @galipremsagar - Transfer correct dtype to exploded column (#11687) @wence-
- Ignore protobuf generated files in
mypychecks (#11685) @galipremsagar - Maintain the index name after
.loc(#11677) @shwina - Fix issue with extracting nested column data & dtype preservation (#11671) @galipremsagar
- Ensure that all cudf tests and benchmarks are conda env aware (#11666) @robertmaynard
- Update to Thrust 1.17.2 to fix cub ODR issues (#11665) @robertmaynard
- Fix multi-file remote datasource bug (#11655) @rjzamora
- Fix invalid regex quantifier check to not include alternation (#11654) @davidwendt
- Fix bug in
device_write(): it uses an incorrect size (#11651) @madsbk - fixes overflows in benchmarks (#11649) @elstehle
- Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
- Fix compile error in benchmark nested_json.cpp (#11637) @davidwendt
- Update zfill to match Python output (#11634) @davidwendt
- Removed converted type for INT32 and INT64 since they do not convert (#11627) @hyperbolic2346
- Fix host scalars construction of nested types (#11612) @galipremsagar
- Fix compile warning in nestedjsongpu.cu (#11607) @davidwendt
- Change default value of
orderedtoFalseinCategoricalDtype(#11604) @galipremsagar - Preserve order if necessary when deduping categoricals internally (#11597) @brandon-b-miller
- Add is_timestamp test for leap second (60) (#11594) @davidwendt
- Fix an issue with
to_arrowwhen column name type is not a string (#11590) @galipremsagar - Fix exception in segmented-reduce benchmark (#11588) @davidwendt
- Fix encode/decode of negative timestamps in ORC reader/writer (#11586) @vuule
- Correct distribution data type in
quantilesbenchmark (#11584) @vuule - Fix multibyte_split benchmark for host buffers (#11583) @upsj
- xfail custreamz display test for now (#11567) @shwina
- Fix JNI for TableWithMeta to use schemainfo instead of columnnames (#11566) @jlowe
- Reduce code duplication for
dask&distributednightly/stable installs (#11565) @galipremsagar - Fix groupby failures in dask_cudf CI (#11561) @rjzamora
- Fix for pivot: error when 'values' is a multicharacter string (#11538) @shaswat-indian
- find_package(cudf) + arrow9 usable with cudf build directory (#11535) @robertmaynard
- Fixing crash when writing binary nested data in parquet (#11526) @hyperbolic2346
- Fix for: error when assigning a value to an empty series (#11523) @shaswat-indian
- Fix invalid results from conditional-left-anti-join in debug build (#11517) @davidwendt
- Fix cmake error after upgrading to Arrow 9 (#11513) @ttnghia
- Fix reverse binary operators acting on a host value and cudf.Scalar (#11512) @bdice
- Update parquet fuzz tests to drop support for
skiprows&num_rows(#11505) @galipremsagar - Use rapids-cmake 22.10 best practice for RAPIDS.cmake location (#11493) @robertmaynard
- Handle some zero-sized corner cases in dlpack interop (#11449) @wence-
- Return empty dataframe when reading an ORC file using empty
columnsoption (#11446) @vuule - libcudf c++ example updated to CPM version 0.35.3 (#11417) @robertmaynard
- Fix regex quantifier check to include capture groups (#11373) @davidwendt
- Fix readtext when byterange is aligned with field (#11371) @upsj
- Fix to_timestamps truncated subsecond calculation (#11367) @davidwendt
- column: calculate null_count before release()ing the cudf::column (#11365) @wence-
π Documentation
- Update
guide-to-udfsnotebook (#11861) @brandon-b-miller - Update docstring for cudf.read_text (#11799) @GregoryKimball
- Add doc section for
list&structhandling (#11770) @galipremsagar - Document that minimum required CMake version is now 3.23.1 (#11751) @robertmaynard
- Update libcudf documentation build command in DOCUMENTATION.md (#11735) @davidwendt
- Add docs for use of string data to
DataFrame.applyandSeries.applyand update guide to UDFs notebook (#11733) @brandon-b-miller - Enable more Pydocstyle rules (#11582) @bdice
- Remove unused cpp/img folder (#11554) @davidwendt
- Publish C++ developer docs (#11475) @vyasr
- Fix a misalignment in
cudf.get_dummiesdocstring (#11443) @galipremsagar - Update contributing doc to include links to the developer guides (#11390) @davidwendt
- Fix tableviewbase doxygen format (#11340) @davidwendt
- Create main developer guide for Python (#11235) @vyasr
- Add developer documentation for benchmarking (#11122) @vyasr
- cuDF error handling document (#7917) @isVoid
π New Features
- Add hasNull statistic reading ability to ORC (#11747) @devavret
- Add
istitleto string UDFs (#11738) @brandon-b-miller - JSON Column creation in GPU (#11714) @karthikeyann
- Adds option to take explicit nested schema for nested JSON reader (#11682) @elstehle
- Add BGZIP
data_chunk_reader(#11652) @upsj - Support DECIMAL order-by for RANGE window functions (#11645) @mythrocks
- changing version of cmake to 3.23.3 (#11619) @hyperbolic2346
- Generate unique keys table in java JNI
contiguousSplitGroups(#11614) @res-life - Generic type casting to support the new nested JSON reader (#11613) @elstehle
- JSON tree traversal (#11610) @karthikeyann
- Add casting operators to masked UDFs (#11578) @brandon-b-miller
- Adds type inference and type conversion for leaf-columns to the nested JSON parser (#11574) @elstehle
- Add strings 'like' function (#11558) @davidwendt
- Handle hyphen as literal for regex cclass when incomplete range (#11557) @davidwendt
- Enable ZSTD compression in ORC and Parquet writers (#11551) @vuule
- Adds support for json lines format to the nested JSON reader (#11534) @elstehle
- Adding optional parquet reader schema (#11524) @hyperbolic2346
- Adds GPU implementation of JSON-token-stream to JSON-tree (#11518) @karthikeyann
- Add
gdbpretty-printers for simple types (#11499) @upsj - Add
create_random_columnfunction to the data generator (#11490) @vuule - Add fluent API builder to
data_profile(#11479) @vuule - Adds Nested Json benchmark (#11466) @karthikeyann
- Convert thrust::optional usages to std::optional (#11455) @robertmaynard
- Python API for the future experimental JSON reader (#11426) @vuule
- Return schema info from JSON reader (#11419) @vuule
- Add regex ASCII flag support for matching builtin character classes (#11404) @davidwendt
- Truncate parquet column indexes (#11403) @etseidl
- Adds the end-to-end JSON parser implementation (#11388) @elstehle
- Use the new JSON parser when the experimental reader is selected (#11364) @vuule
- Add placeholder for the experimental JSON reader (#11334) @vuule
- Add read-only functions on string dtypes to
DataFrame.applyandSeries.apply(#11319) @brandon-b-miller - Added 'crosstab' and 'pivot_table' features (#11314) @shaswat-indian
- Quickly error out when trying to build with unsupported nvcc versions (#11297) @robertmaynard
- Adds JSON tokenizer (#11264) @elstehle
- List lexicographic comparator (#11129) @devavret
- Add generic type inference for cuIO (#11121) @PointKernel
- Fully support nested types in
cudf::contains(#10656) @ttnghia - Support nested types in
lists::contains(#10548) @ttnghia
π οΈ Improvements
- Pin
daskanddistributedfor release (#11822) @galipremsagar - Add examples for Nested JSON reader (#11814) @GregoryKimball
- Support shuffle-based groupby aggregations in dask_cudf (#11800) @rjzamora
- Update strings udf version updater script (#11772) @galipremsagar
- Remove
kwargsinread_csv&to_csv(#11762) @galipremsagar - Pass
dtypeparam to avoidpd.Serieswarnings (#11761) @galipremsagar - Enable
schema_element&keep_quotessupport in json reader (#11746) @galipremsagar - Add ability to construct
ListColumnwhen size isNone(#11745) @galipremsagar - Reduces memory requirements in JSON parser and adds bytes/s and peak memory usage to benchmarks (#11732) @elstehle
- Add missing copyright headers. (#11712) @bdice
- Fix copyright check issues in pre-commit (#11711) @bdice
- Include decimal in supported types for range window order-by columns (#11710) @mythrocks
- Disable very large column gtest for contiguous-split (#11706) @davidwendt
- Drop split_out=None test from groupby.agg (#11704) @wence-
- Use CubinLinker for CUDA Minor Version Compatibility (#11701) @gmarkall
- Add regex capture-group parameter to auto convert to non-capture groups (#11695) @davidwendt
- Add a
__dataframe__method to the protocol dataframe object (#11692) @rgommers - Special-case multibyte_split for single-byte delimiter (#11681) @upsj
- Remove isort exclusions (#11680) @bdice
- Refactor CSV reader benchmarks with nvbench (#11678) @PointKernel
- Check conda recipe headers with pre-commit (#11669) @bdice
- Remove redundant style check for clang-format. (#11668) @bdice
- Add support for
group_keysingroupby(#11659) @galipremsagar - Fix pandoc pinning. (#11658) @bdice
- Revert removal of skiprows / numrows options from the Parquet reader. (#11657) @nvdbaranec
- Update git metadata (#11647) @bdice
- Call setnullcount on a returning column if null-count is known (#11646) @davidwendt
- Fix some libcudf detail calls not passing the stream variable (#11642) @davidwendt
- Update to mypy 0.971 (#11640) @wence-
- Refactor strings strip functor to details header (#11635) @davidwendt
- Fix incorrect
nullCountinget_json_object(#11633) @trxcllnt - Simplify
hostdevice_vector(#11631) @upsj - Refactor parquet writer benchmarks with nvbench (#11623) @PointKernel
- Rework contains_scalar to check nulls at runtime (#11622) @davidwendt
- Fix incorrect memory resource used in rolling temp columns (#11618) @mythrocks
- Upgrade
pandasto1.5(#11617) @galipremsagar - Move type-dispatcher calls from traits.hpp to traits.cpp (#11616) @davidwendt
- Refactor parquet reader benchmarks with nvbench (#11611) @PointKernel
- Forward-merge branch-22.08 to branch-22.10 (#11608) @bdice
- Use stream in Java API. (#11601) @bdice
- Refactors of public/detail APIs, CUDFFUNCRANGE, stream handling. (#11600) @bdice
- Improve ORC writer benchmark with nvbench (#11598) @PointKernel
- Tune multibyte_split kernel (#11587) @upsj
- Move split_utils.cuh to strings/detail (#11585) @davidwendt
- Fix warnings due to compiler regression with
if constexpr(#11581) @ttnghia - Add full 24-bit dictionary support to Parquet writer (#11580) @etseidl
- Expose "explicit-comms" option in shuffle-based dask_cudf functions (#11576) @rjzamora
- Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
- Refactor daskcudf groupby to use applyconcat_apply (#11571) @rjzamora
- Add ability to write
list(struct)columns asmaptype in orc writer (#11568) @galipremsagar - Add byterange to multibytesplit benchmark + NVBench refactor (#11562) @upsj
- JNI support for writing binary columns in parquet (#11556) @revans2
- Support additional dictionary bit widths in Parquet writer (#11547) @etseidl
- Refactor string/numeric conversion utilities (#11545) @davidwendt
- Removing unnecessary asserts in parquet tests (#11544) @hyperbolic2346
- Clean up ORC reader benchmarks with NVBench (#11543) @PointKernel
- Reuse MurmurHash3_32 in Parquet page data. (#11528) @bdice
- Add hexadecimal value separators (#11527) @bdice
- Deprecate
skiprowsandnum_rowsinread_orc(#11522) @galipremsagar - Struct support for
NULL_EQUALSbinary operation (#11520) @rwlee - Bump hadoop-common from 3.2.3 to 3.2.4 in /java (#11516) @dependabot[bot]
- Fix Feather test warning. (#11511) @bdice
- copyrange ballotsyncs to have no execution dependency (#11508) @robertmaynard
- Upgrade to
arrow-9.x(#11507) @galipremsagar - Remove support for skiprows / numrows options in the parquet reader. (#11503) @nvdbaranec
- Single-pass
multibyte_split(#11500) @upsj - Sanitize percentile_approx() output for empty input (#11498) @SrikarVanavasam
- Unpin
daskanddistributedfor development (#11492) @galipremsagar - Move SparkMurmurHash3_32 functor. (#11489) @bdice
- Refactor group_nunique.cu to use nullate::DYNAMIC for reduce-by-key functor (#11482) @davidwendt
- Drop support for
skiprowsandnum_rowsincudf.read_parquet(#11480) @galipremsagar - Add reduction
distinct_countbenchmark (#11473) @ttnghia - Add groupby
nuniqueaggregation benchmark (#11472) @ttnghia - Disable Arrow S3 support by default. (#11470) @bdice
- Add groupby
maxaggregation benchmark (#11464) @ttnghia - Extract Dremel encoding code from Parquet (#11461) @vyasr
- Add missing Thrust #includes. (#11457) @bdice
- Make CMake hooks verbose (#11456) @vyasr
- Control Parquet page size through Python API (#11454) @etseidl
- Add control of Parquet column index creation to python (#11453) @etseidl
- Remove unused is_struct trait. (#11450) @bdice
- Refactor the
Bufferclass (#11447) @madsbk - Refactor padside and striptype enums into side_type enum (#11438) @davidwendt
- Update to Thrust 1.17.0 (#11437) @bdice
- Add in JNI for parsing JSON data and getting the metadata back too. (#11431) @revans2
- Convert bytearrayview to use std::byte (#11424) @hyperbolic2346
- Deprecate unflattennestedcolumns (#11421) @SrikarVanavasam
- Remove HASHSERIALMURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
- Add Spark list hashing Java tests (#11379) @bdice
- Move cmake to the build section. (#11376) @vyasr
- Remove use of CUDA driver API calls from libcudf (#11370) @shwina
- Add column constructor from device_uvector&& (#11356) @SrikarVanavasam
- Remove unused custreamz thirdparty directory (#11343) @vyasr
- Update jni version to 22.10.0-SNAPSHOT (#11338) @pxLi
- Enable using upstream jitify2 (#11287) @shwina
- Cache cudf.Scalar (#11246) @shwina
- Remove deprecated Series.applymap. (#11031) @bdice
- Remove deprecated expand parameter from str.findall. (#11030) @bdice
- C++
Published by GPUtester over 3 years ago
https://github.com/rapidsai/cudf - v22.10.00
π¨ Breaking Changes
- Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
- Disable nvCOMP DEFLATE integration (#11811) @vuule
- Fix return type of
Index.isna&Index.notna(#11769) @galipremsagar - Remove
kwargsinread_csv&to_csv(#11762) @galipremsagar - Fix
cudf::partition*APIs that do not return offsets for empty output table (#11709) @ttnghia - Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
- Update zfill to match Python output (#11634) @davidwendt
- Upgrade
pandasto1.5(#11617) @galipremsagar - Change default value of
orderedtoFalseinCategoricalDtype(#11604) @galipremsagar - Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
- Adding optional parquet reader schema (#11524) @hyperbolic2346
- Deprecate
skiprowsandnum_rowsinread_orc(#11522) @galipremsagar - Remove support for skiprows / numrows options in the parquet reader. (#11503) @nvdbaranec
- Drop support for
skiprowsandnum_rowsincudf.read_parquet(#11480) @galipremsagar - Disable Arrow S3 support by default. (#11470) @bdice
- Convert thrust::optional usages to std::optional (#11455) @robertmaynard
- Remove unused is_struct trait. (#11450) @bdice
- Refactor the
Bufferclass (#11447) @madsbk - Return empty dataframe when reading an ORC file using empty
columnsoption (#11446) @vuule - Refactor padside and striptype enums into side_type enum (#11438) @davidwendt
- Remove HASHSERIALMURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
- Use the new JSON parser when the experimental reader is selected (#11364) @vuule
- Remove deprecated Series.applymap. (#11031) @bdice
- Remove deprecated expand parameter from str.findall. (#11030) @bdice
π Bug Fixes
- Fixes bug in temporary decompression space estimation before calling nvcomp (#11879) @abellina
- Handle
ptxfile paths duringstrings_udfimport (#11862) @galipremsagar - Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
- Reset
strings_udfCEC and solve several related issues (#11846) @brandon-b-miller - Fix bug in new shuffle-based groupby implementation (#11836) @rjzamora
- Fix
is_validchecks inScalar._binaryop(#11818) @wence- - Fix operator
NotImplementedissue withnumpy(#11816) @galipremsagar - Disable nvCOMP DEFLATE integration (#11811) @vuule
- Build
strings_udfpackage with other python packages in nightlies (#11808) @brandon-b-miller - Revert problematic shuffle=explicit-comms changes (#11803) @rjzamora
- Fix regex out-of-bounds write in strided rows logic (#11797) @davidwendt
- Build
cudflocally before buildingstrings_udfconda packages in CI (#11785) @brandon-b-miller - Fix an issue in cudf::rowbitcount involving structs and lists at multiple levels. (#11779) @nvdbaranec
- Fix return type of
Index.isna&Index.notna(#11769) @galipremsagar - Fix issue with set-item incase of
listandstructtypes (#11760) @galipremsagar - Ensure all libcudf APIs run on cudf's default stream (#11759) @vyasr
- Resolve dask_cudf failures caused by upstream groupby changes (#11755) @rjzamora
- Fix ORC string sum statistics (#11740) @vuule
- Add
strings_udfpackage for python 3.9 (#11730) @brandon-b-miller - Ensure that all tests launch kernels on cudf's default stream (#11726) @vyasr
- Don't assume stream is a compile-time constant expression (#11725) @vyasr
- Fix get_thrust.cmake format at patch command (#11715) @davidwendt
- Fix
cudf::partition*APIs that do not return offsets for empty output table (#11709) @ttnghia - Fix cudf::lists::sort_lists for NaN and Infinity values (#11703) @davidwendt
- Modify ORC reader timestamp parsing to match the apache reader behavior (#11699) @vuule
- Fix
DataFrame.from_arrowto preserve type metadata (#11698) @galipremsagar - Fix compile error due to missing header (#11697) @ttnghia
- Default to Snappy compression in
to_orcwhen using cuDF or Dask (#11690) @vuule - Fix an issue related to
Multindexwhengroup_keys=True(#11689) @galipremsagar - Transfer correct dtype to exploded column (#11687) @wence-
- Ignore protobuf generated files in
mypychecks (#11685) @galipremsagar - Maintain the index name after
.loc(#11677) @shwina - Fix issue with extracting nested column data & dtype preservation (#11671) @galipremsagar
- Ensure that all cudf tests and benchmarks are conda env aware (#11666) @robertmaynard
- Update to Thrust 1.17.2 to fix cub ODR issues (#11665) @robertmaynard
- Fix multi-file remote datasource bug (#11655) @rjzamora
- Fix invalid regex quantifier check to not include alternation (#11654) @davidwendt
- Fix bug in
device_write(): it uses an incorrect size (#11651) @madsbk - fixes overflows in benchmarks (#11649) @elstehle
- Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
- Fix compile error in benchmark nested_json.cpp (#11637) @davidwendt
- Update zfill to match Python output (#11634) @davidwendt
- Removed converted type for INT32 and INT64 since they do not convert (#11627) @hyperbolic2346
- Fix host scalars construction of nested types (#11612) @galipremsagar
- Fix compile warning in nestedjsongpu.cu (#11607) @davidwendt
- Change default value of
orderedtoFalseinCategoricalDtype(#11604) @galipremsagar - Preserve order if necessary when deduping categoricals internally (#11597) @brandon-b-miller
- Add is_timestamp test for leap second (60) (#11594) @davidwendt
- Fix an issue with
to_arrowwhen column name type is not a string (#11590) @galipremsagar - Fix exception in segmented-reduce benchmark (#11588) @davidwendt
- Fix encode/decode of negative timestamps in ORC reader/writer (#11586) @vuule
- Correct distribution data type in
quantilesbenchmark (#11584) @vuule - Fix multibyte_split benchmark for host buffers (#11583) @upsj
- xfail custreamz display test for now (#11567) @shwina
- Fix JNI for TableWithMeta to use schemainfo instead of columnnames (#11566) @jlowe
- Reduce code duplication for
dask&distributednightly/stable installs (#11565) @galipremsagar - Fix groupby failures in dask_cudf CI (#11561) @rjzamora
- Fix for pivot: error when 'values' is a multicharacter string (#11538) @shaswat-indian
- find_package(cudf) + arrow9 usable with cudf build directory (#11535) @robertmaynard
- Fixing crash when writing binary nested data in parquet (#11526) @hyperbolic2346
- Fix for: error when assigning a value to an empty series (#11523) @shaswat-indian
- Fix invalid results from conditional-left-anti-join in debug build (#11517) @davidwendt
- Fix cmake error after upgrading to Arrow 9 (#11513) @ttnghia
- Fix reverse binary operators acting on a host value and cudf.Scalar (#11512) @bdice
- Update parquet fuzz tests to drop support for
skiprows&num_rows(#11505) @galipremsagar - Use rapids-cmake 22.10 best practice for RAPIDS.cmake location (#11493) @robertmaynard
- Handle some zero-sized corner cases in dlpack interop (#11449) @wence-
- Return empty dataframe when reading an ORC file using empty
columnsoption (#11446) @vuule - libcudf c++ example updated to CPM version 0.35.3 (#11417) @robertmaynard
- Fix regex quantifier check to include capture groups (#11373) @davidwendt
- Fix readtext when byterange is aligned with field (#11371) @upsj
- Fix to_timestamps truncated subsecond calculation (#11367) @davidwendt
- column: calculate null_count before release()ing the cudf::column (#11365) @wence-
π Documentation
- Update
guide-to-udfsnotebook (#11861) @brandon-b-miller - Update docstring for cudf.read_text (#11799) @GregoryKimball
- Add doc section for
list&structhandling (#11770) @galipremsagar - Document that minimum required CMake version is now 3.23.1 (#11751) @robertmaynard
- Update libcudf documentation build command in DOCUMENTATION.md (#11735) @davidwendt
- Add docs for use of string data to
DataFrame.applyandSeries.applyand update guide to UDFs notebook (#11733) @brandon-b-miller - Enable more Pydocstyle rules (#11582) @bdice
- Remove unused cpp/img folder (#11554) @davidwendt
- Publish C++ developer docs (#11475) @vyasr
- Fix a misalignment in
cudf.get_dummiesdocstring (#11443) @galipremsagar - Update contributing doc to include links to the developer guides (#11390) @davidwendt
- Fix tableviewbase doxygen format (#11340) @davidwendt
- Create main developer guide for Python (#11235) @vyasr
- Add developer documentation for benchmarking (#11122) @vyasr
- cuDF error handling document (#7917) @isVoid
π New Features
- Add hasNull statistic reading ability to ORC (#11747) @devavret
- Add
istitleto string UDFs (#11738) @brandon-b-miller - JSON Column creation in GPU (#11714) @karthikeyann
- Adds option to take explicit nested schema for nested JSON reader (#11682) @elstehle
- Add BGZIP
data_chunk_reader(#11652) @upsj - Support DECIMAL order-by for RANGE window functions (#11645) @mythrocks
- changing version of cmake to 3.23.3 (#11619) @hyperbolic2346
- Generate unique keys table in java JNI
contiguousSplitGroups(#11614) @res-life - Generic type casting to support the new nested JSON reader (#11613) @elstehle
- JSON tree traversal (#11610) @karthikeyann
- Add casting operators to masked UDFs (#11578) @brandon-b-miller
- Adds type inference and type conversion for leaf-columns to the nested JSON parser (#11574) @elstehle
- Add strings 'like' function (#11558) @davidwendt
- Handle hyphen as literal for regex cclass when incomplete range (#11557) @davidwendt
- Enable ZSTD compression in ORC and Parquet writers (#11551) @vuule
- Adds support for json lines format to the nested JSON reader (#11534) @elstehle
- Adding optional parquet reader schema (#11524) @hyperbolic2346
- Adds GPU implementation of JSON-token-stream to JSON-tree (#11518) @karthikeyann
- Add
gdbpretty-printers for simple types (#11499) @upsj - Add
create_random_columnfunction to the data generator (#11490) @vuule - Add fluent API builder to
data_profile(#11479) @vuule - Adds Nested Json benchmark (#11466) @karthikeyann
- Convert thrust::optional usages to std::optional (#11455) @robertmaynard
- Python API for the future experimental JSON reader (#11426) @vuule
- Return schema info from JSON reader (#11419) @vuule
- Add regex ASCII flag support for matching builtin character classes (#11404) @davidwendt
- Truncate parquet column indexes (#11403) @etseidl
- Adds the end-to-end JSON parser implementation (#11388) @elstehle
- Use the new JSON parser when the experimental reader is selected (#11364) @vuule
- Add placeholder for the experimental JSON reader (#11334) @vuule
- Add read-only functions on string dtypes to
DataFrame.applyandSeries.apply(#11319) @brandon-b-miller - Added 'crosstab' and 'pivot_table' features (#11314) @shaswat-indian
- Quickly error out when trying to build with unsupported nvcc versions (#11297) @robertmaynard
- Adds JSON tokenizer (#11264) @elstehle
- List lexicographic comparator (#11129) @devavret
- Add generic type inference for cuIO (#11121) @PointKernel
- Fully support nested types in
cudf::contains(#10656) @ttnghia - Support nested types in
lists::contains(#10548) @ttnghia
π οΈ Improvements
- Pin
daskanddistributedfor release (#11822) @galipremsagar - Add examples for Nested JSON reader (#11814) @GregoryKimball
- Support shuffle-based groupby aggregations in dask_cudf (#11800) @rjzamora
- Update strings udf version updater script (#11772) @galipremsagar
- Remove
kwargsinread_csv&to_csv(#11762) @galipremsagar - Pass
dtypeparam to avoidpd.Serieswarnings (#11761) @galipremsagar - Enable
schema_element&keep_quotessupport in json reader (#11746) @galipremsagar - Add ability to construct
ListColumnwhen size isNone(#11745) @galipremsagar - Reduces memory requirements in JSON parser and adds bytes/s and peak memory usage to benchmarks (#11732) @elstehle
- Add missing copyright headers. (#11712) @bdice
- Fix copyright check issues in pre-commit (#11711) @bdice
- Include decimal in supported types for range window order-by columns (#11710) @mythrocks
- Disable very large column gtest for contiguous-split (#11706) @davidwendt
- Drop split_out=None test from groupby.agg (#11704) @wence-
- Use CubinLinker for CUDA Minor Version Compatibility (#11701) @gmarkall
- Add regex capture-group parameter to auto convert to non-capture groups (#11695) @davidwendt
- Add a
__dataframe__method to the protocol dataframe object (#11692) @rgommers - Special-case multibyte_split for single-byte delimiter (#11681) @upsj
- Remove isort exclusions (#11680) @bdice
- Refactor CSV reader benchmarks with nvbench (#11678) @PointKernel
- Check conda recipe headers with pre-commit (#11669) @bdice
- Remove redundant style check for clang-format. (#11668) @bdice
- Add support for
group_keysingroupby(#11659) @galipremsagar - Fix pandoc pinning. (#11658) @bdice
- Revert removal of skiprows / numrows options from the Parquet reader. (#11657) @nvdbaranec
- Update git metadata (#11647) @bdice
- Call setnullcount on a returning column if null-count is known (#11646) @davidwendt
- Fix some libcudf detail calls not passing the stream variable (#11642) @davidwendt
- Update to mypy 0.971 (#11640) @wence-
- Refactor strings strip functor to details header (#11635) @davidwendt
- Fix incorrect
nullCountinget_json_object(#11633) @trxcllnt - Simplify
hostdevice_vector(#11631) @upsj - Refactor parquet writer benchmarks with nvbench (#11623) @PointKernel
- Rework contains_scalar to check nulls at runtime (#11622) @davidwendt
- Fix incorrect memory resource used in rolling temp columns (#11618) @mythrocks
- Upgrade
pandasto1.5(#11617) @galipremsagar - Move type-dispatcher calls from traits.hpp to traits.cpp (#11616) @davidwendt
- Refactor parquet reader benchmarks with nvbench (#11611) @PointKernel
- Forward-merge branch-22.08 to branch-22.10 (#11608) @bdice
- Use stream in Java API. (#11601) @bdice
- Refactors of public/detail APIs, CUDFFUNCRANGE, stream handling. (#11600) @bdice
- Improve ORC writer benchmark with nvbench (#11598) @PointKernel
- Tune multibyte_split kernel (#11587) @upsj
- Move split_utils.cuh to strings/detail (#11585) @davidwendt
- Fix warnings due to compiler regression with
if constexpr(#11581) @ttnghia - Add full 24-bit dictionary support to Parquet writer (#11580) @etseidl
- Expose "explicit-comms" option in shuffle-based dask_cudf functions (#11576) @rjzamora
- Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
- Refactor daskcudf groupby to use applyconcat_apply (#11571) @rjzamora
- Add ability to write
list(struct)columns asmaptype in orc writer (#11568) @galipremsagar - Add byterange to multibytesplit benchmark + NVBench refactor (#11562) @upsj
- JNI support for writing binary columns in parquet (#11556) @revans2
- Support additional dictionary bit widths in Parquet writer (#11547) @etseidl
- Refactor string/numeric conversion utilities (#11545) @davidwendt
- Removing unnecessary asserts in parquet tests (#11544) @hyperbolic2346
- Clean up ORC reader benchmarks with NVBench (#11543) @PointKernel
- Reuse MurmurHash3_32 in Parquet page data. (#11528) @bdice
- Add hexadecimal value separators (#11527) @bdice
- Deprecate
skiprowsandnum_rowsinread_orc(#11522) @galipremsagar - Struct support for
NULL_EQUALSbinary operation (#11520) @rwlee - Bump hadoop-common from 3.2.3 to 3.2.4 in /java (#11516) @dependabot[bot]
- Fix Feather test warning. (#11511) @bdice
- copyrange ballotsyncs to have no execution dependency (#11508) @robertmaynard
- Upgrade to
arrow-9.x(#11507) @galipremsagar - Remove support for skiprows / numrows options in the parquet reader. (#11503) @nvdbaranec
- Single-pass
multibyte_split(#11500) @upsj - Sanitize percentile_approx() output for empty input (#11498) @SrikarVanavasam
- Unpin
daskanddistributedfor development (#11492) @galipremsagar - Move SparkMurmurHash3_32 functor. (#11489) @bdice
- Refactor group_nunique.cu to use nullate::DYNAMIC for reduce-by-key functor (#11482) @davidwendt
- Drop support for
skiprowsandnum_rowsincudf.read_parquet(#11480) @galipremsagar - Add reduction
distinct_countbenchmark (#11473) @ttnghia - Add groupby
nuniqueaggregation benchmark (#11472) @ttnghia - Disable Arrow S3 support by default. (#11470) @bdice
- Add groupby
maxaggregation benchmark (#11464) @ttnghia - Extract Dremel encoding code from Parquet (#11461) @vyasr
- Add missing Thrust #includes. (#11457) @bdice
- Make CMake hooks verbose (#11456) @vyasr
- Control Parquet page size through Python API (#11454) @etseidl
- Add control of Parquet column index creation to python (#11453) @etseidl
- Remove unused is_struct trait. (#11450) @bdice
- Refactor the
Bufferclass (#11447) @madsbk - Refactor padside and striptype enums into side_type enum (#11438) @davidwendt
- Update to Thrust 1.17.0 (#11437) @bdice
- Add in JNI for parsing JSON data and getting the metadata back too. (#11431) @revans2
- Convert bytearrayview to use std::byte (#11424) @hyperbolic2346
- Deprecate unflattennestedcolumns (#11421) @SrikarVanavasam
- Remove HASHSERIALMURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
- Add Spark list hashing Java tests (#11379) @bdice
- Move cmake to the build section. (#11376) @vyasr
- Remove use of CUDA driver API calls from libcudf (#11370) @shwina
- Add column constructor from device_uvector&& (#11356) @SrikarVanavasam
- Remove unused custreamz thirdparty directory (#11343) @vyasr
- Update jni version to 22.10.0-SNAPSHOT (#11338) @pxLi
- Enable using upstream jitify2 (#11287) @shwina
- Cache cudf.Scalar (#11246) @shwina
- Remove deprecated Series.applymap. (#11031) @bdice
- Remove deprecated expand parameter from str.findall. (#11030) @bdice
- C++
Published by GPUtester over 3 years ago
https://github.com/rapidsai/cudf - v22.08.01
π¨ Breaking Changes
- Pin
numpyto<1.23(#11824) @galipremsagar - Remove legacy join APIs (#11274) @vyasr
- Remove
lists::drop_list_duplicates(#11236) @ttnghia - Remove Index.replace API (#11131) @vyasr
- Remove deprecated Index methods from Frame (#11073) @vyasr
- Remove public API of cudf.merge_sorted. (#11032) @bdice
- Drop python
3.7in code-base (#11029) @galipremsagar - Return empty dataframe when reading a Parquet file using empty
columnsoption (#11018) @vuule - Remove Arrow CUDA IPC code (#10995) @shwina
- Buffer: make
.ptrread-only (#10872) @madsbk
π Bug Fixes
- Fix out-of-bound access in
cudf::detail::label_segments(#11497) @ttnghia - Fix
distributederror related toloop_in_thread(#11428) @galipremsagar - Fix atomic operations on NaN values (#11420) @ttnghia
- Relax arrow pinning to just 8.x and remove cuda build dependency from cudf recipe (#11412) @kkraus14
- Revert "Allow CuPy 11" (#11409) @jakirkham
- Fix
mototimeouts (#11369) @galipremsagar - Set
+/-infinityas theidentityvalues for floating-point numbers in device operatorsminandmax(#11357) @ttnghia - Fix memory_usage() for
ListSeries(#11355) @thomcom - Fix constructing Column from column_view with expired mask (#11354) @shwina
- Handle parquet corner case: Columns with more rows than are in the row group. (#11353) @nvdbaranec
- Fix
DatetimeIndex&TimedeltaIndexconstructors (#11342) @galipremsagar - Fix unsigned-compare compile warning in IntPow binops (#11339) @davidwendt
- Fix performance issue and add a new code path to
cudf::detail::contains(#11330) @ttnghia - Pin
pytorchto temporarily unblock fromlibcuptierrors (#11289) @galipremsagar - Workaround for nvcomp zstd overwriting blocks for orc due to underestimate of sizes (#11288) @jbrennan333
- Fix inconsistency when hashing two tables in
cudf::detail::contains(#11284) @ttnghia - Fix issue related to numpy array and
categorydtype (#11282) @galipremsagar - Add NotImplementedError when on is specified in DataFrame.join. (#11275) @vyasr
- Fix invalid allocatelike() and emptylike() tests. (#11268) @nvdbaranec
- Returns DataFrame When Concating Along Axis 1 (#11263) @isVoid
- Fix compile error due to missing header (#11257) @ttnghia
- Fix a memory aliasing/crash issue in scatter for lists. (#11254) @nvdbaranec
- Fix
tests/rolling/empty_input_test(#11238) @ttnghia - Fix const qualifier when using
host_span<bitmask_type const*>(#11220) @ttnghia - Avoid using
nvcompBatchedDeflateDecompressGetTempSizeExin cuIO (#11213) @vuule - Generate benchmark data with correct run length regardless of cardinality (#11205) @vuule
- Fix cumulative count index behavior (#11188) @brandon-b-miller
- Fix assertion in daskcudf teststruct_explode (#11170) @rjzamora
- Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager (#11161) @res-life
- Fix compatibility issues with pandas 1.4.3 (#11152) @vyasr
- Ensure cuco export set is installed in cmake build (#11147) @jlowe
- Avoid redundant deepcopy in
cudf.from_pandas(#11142) @galipremsagar - Fix compile error due to missing header (#11126) @ttnghia
- Fix
__cuda_array_interface__failures (#11113) @galipremsagar - Support octal and hex within regex character class pattern (#11112) @davidwendt
- Fix split_re matching logic for word boundaries (#11106) @davidwendt
- Handle multiple files metadata in
read_parquet(#11105) @galipremsagar - Fix index alignment for Series objects with repeated index (#11103) @shwina
- FindcuFile now searches in the current CUDA Toolkit location (#11101) @robertmaynard
- Fix regex word boundary logic to include underline (#11099) @davidwendt
- Exclude CudaFatalTest when selecting all Java tests (#11083) @jlowe
- Fix duplicate
cudatoolkitpinning issue (#11070) @galipremsagar - Maintain the input index in the result of a groupby-transform (#11068) @shwina
- Fix bug with row count comparison for expectcolumnsequivalent(). (#11059) @nvdbaranec
- Fix BPE uninitialized size value for null and empty input strings (#11054) @davidwendt
- Include missing header for usage of
get_current_device_resource()(#11047) @AtlantaPepsi - Fix warnunusedresult error in parquet test (#11026) @karthikeyann
- Return empty dataframe when reading a Parquet file using empty
columnsoption (#11018) @vuule - Fix small error in page row count limiting (#10991) @etseidl
- Fix a row index entry error in ORC writer issue (#10989) @vuule
- Fix grouped covariance to require both values to be convertible to double. (#10891) @bdice
π Documentation
- Defer loading of
custom.js(#11465) @galipremsagar - Fix issues with day & night modes in python docs (#11400) @galipremsagar
- Update missing data handling APIs in docs (#11345) @galipremsagar
- Add lists filtering APIs to doxygen group. (#11336) @bdice
- Remove unused import in README sample (#11318) @vyasr
- Note null behavior in
wheredocs (#11276) @brandon-b-miller - Update docstring for spans in
get_row_data_range(#11271) @vyasr - Update nvCOMP integration table (#11231) @vuule
- Add dev docs for documentation writing (#11217) @vyasr
- Documentation fix for concatenate (#11187) @dagardner-nv
- Fix unresolved links in markdown (#11173) @karthikeyann
- Fix cudf version in README.md install commands (#11164) @jvanstraten
- Switch
languagefromNoneto"en"in docs build (#11133) @galipremsagar - Remove docs mentioning scalar_view since no such class exists. (#11132) @bdice
- Add docstring entry for
DataFrame.value_counts(#11039) @galipremsagar - Add docs to rolling var, std, count. (#11035) @bdice
- Fix docs for Numba UDFs. (#11020) @bdice
- Replace column comparison utilities functions with macros (#11007) @karthikeyann
- Fix Doxygen warnings in multiple headers files (#11003) @karthikeyann
- Fix doxygen warnings in utilities/ headers (#10974) @karthikeyann
- Fix Doxygen warnings in table header files (#10964) @karthikeyann
- Fix Doxygen warnings in column header files (#10963) @karthikeyann
- Fix Doxygen warnings in strings / header files (#10937) @karthikeyann
- Generate Doxygen Tag File for Libcudf (#10932) @isVoid
- Fix doxygen warnings in structs, lists headers (#10923) @karthikeyann
- Fix doxygen warnings in fixed_point.hpp (#10922) @karthikeyann
- Fix doxygen warnings in ast/, rolling, tdigest/, wrappers/, dictionary/ headers (#10921) @karthikeyann
- fix doxygen warnings in cudf/io/types.hpp, other header files (#10913) @karthikeyann
- fix doxygen warnings in cudf/io/ avro, csv, json, orc, parquet header files (#10912) @karthikeyann
- Fix doxygen warnings in cudf/*.hpp (#10896) @karthikeyann
- Add missing documentation in aggregation.hpp (#10887) @karthikeyann
- Revise PR template. (#10774) @bdice
π New Features
- Change cmake to allow controlling Arrow version via cmake variable (#11429) @kkraus14
- Adding support for list<int8> columns to be written as byte arrays in parquet (#11328) @hyperbolic2346
- Adding byte array view structure (#11322) @hyperbolic2346
- Adding byte_array statistics (#11303) @hyperbolic2346
- Add column indexes to Parquet writer (#11302) @etseidl
- Provide an Option for Default Integer and Floating Bitwidth (#11272) @isVoid
- FST benchmark (#11243) @karthikeyann
- Adds the Finite-State Transducer algorithm (#11242) @elstehle
- Refactor
collect_setto usecudf::distinctandcudf::lists::distinct(#11228) @ttnghia - Treat zstd as stable in nvcomp releases 2.3.2 and later (#11226) @jbrennan333
- Add 24 bit dictionary support to Parquet writer (#11216) @devavret
- Enable positive group indices for extractAllRecord on JNI (#11215) @anthony-chang
- JNI bindings for NTH_ELEMENT window aggregation (#11201) @mythrocks
- Add JNI bindings for extractAllRecord (#11196) @anthony-chang
- Add
cudf.options(#11193) @isVoid - Add thrift support for parquet column and offset indexes (#11178) @etseidl
- Adding binary read/write as options for parquet (#11160) @hyperbolic2346
- Support
nth_elementfor window functions (#11158) @mythrocks - Implement
lists::distinctandcudf::detail::stable_distinct(#11149) @ttnghia - Implement Groupby pct_change (#11144) @skirui-source
- Add JNI for set operations (#11143) @ttnghia
- Remove deprecated PERTHREADDEFAULT_STREAM (#11134) @jbrennan333
- Added a Java method to check the existence of a list of keys in a map (#11128) @razajafri
- Feature/python benchmarking (#11125) @vyasr
- Support
nan_equalityincudf::distinct(#11118) @ttnghia - Added JNI for getMapValueForKeys (#11104) @razajafri
- Refactor
semi_anti_join(#11100) @ttnghia - Replace remaining instances of rmm::cudastreamdefault with cudf::defaultstreamvalue (#11082) @jbrennan333
- Adds the Logical Stack algorithm (#11078) @elstehle
- Add doxygen-check pre-commit hook (#11076) @karthikeyann
- Use new nvCOMP API to optimize the decompression temp memory size (#11064) @vuule
- Add Doxygen CI check (#11057) @karthikeyann
- Support
duplicate_keep_optionincudf::distinct(#11052) @ttnghia - Support set operations (#11043) @ttnghia
- Support for ZLIB compression in ORC writer (#11036) @vuule
- Adding feature swaplevels (#11027) @VamsiTallam95
- Use nvCOMP for ZLIB decompression in ORC reader (#11024) @vuule
- Function for bfill, ffill #9591 (#11022) @Sreekiran096
- Generate group offsets from element labels (#11017) @ttnghia
- Feature axes (#10979) @VamsiTallam95
- Generate group labels from offsets (#10945) @ttnghia
- Add missing cuIO benchmark coverage for duration types (#10933) @vuule
- Dask-cuDF cumulative groupby ops (#10889) @brandon-b-miller
- Reindex Improvements (#10815) @brandon-b-miller
- Implement value_counts for DataFrame (#10813) @martinfalisse
π οΈ Improvements
- Pin
numpyto<1.23(#11824) @galipremsagar - Make Index Join Tests on Default Precisions Deterministic (#11451) @isVoid
- Pin
dask&distributedfor release (#11433) @galipremsagar - Use documented header template for
doxygen(#11430) @galipremsagar - Relax arrow version in dev env (#11418) @galipremsagar
- Added Java bindings for Parquet options for binary read (#11410) @razajafri
- Allow CuPy 11 (#11393) @jakirkham
- Improve multibyte_split performance (#11347) @cwharris
- Switch death test to use explicit trap. (#11326) @vyasr
- Add --output-on-failure to ctest args. (#11321) @vyasr
- Consolidate remaining DataFrame/Series APIs (#11315) @vyasr
- Add JNI support for the join_strings API (#11309) @revans2
- Add cupy version to setup.py install_requires (#11306) @vyasr
- removing some unused code (#11305) @hyperbolic2346
- Add test of wildcard selection (#11300) @vyasr
- Update parquet reader to take stream parameter (#11294) @PointKernel
- Spark list hashing (#11292) @bdice
- Remove legacy join APIs (#11274) @vyasr
- Fix
cudfrecipes syntax (#11273) @ajschmidt8 - Fix
cudfrecipe (#11267) @ajschmidt8 - Cleanup config files (#11266) @vyasr
- Run mypy on all packages (#11265) @vyasr
- Update to isort 5.10.1. (#11262) @vyasr
- Consolidate flake8 and pydocstyle configuration (#11260) @vyasr
- Remove redundant black config specifications. (#11258) @vyasr
- Ensure DeprecationWarnings are not introduced via pre-commit (#11255) @wence-
- Optimization to gpu::PreprocessColumnData in parquet reader. (#11252) @nvdbaranec
- Move rolling impl details to detail/ directory. (#11250) @mythrocks
- Remove
lists::drop_list_duplicates(#11236) @ttnghia - Use
cudf::lists::distinctin Python binding (#11234) @ttnghia - Use
cudf::lists::distinctin Java binding (#11233) @ttnghia - Use
cudf::distinctin Java binding (#11232) @ttnghia - Pin
dask-cudain dev environment (#11229) @galipremsagar - Remove cruft in map_lookup (#11221) @mythrocks
- Deprecate
skiprows&num_rowsin parquet reader (#11218) @galipremsagar - Remove Frame._index (#11210) @vyasr
- Improve performance for
cudf::containswhen searching for a scalar (#11202) @ttnghia - Document why Development component is needing for CMake. (#11200) @vyasr
- cleanup unused code in rolling_test.hpp (#11195) @karthikeyann
- Standardize join internals around DataFrame (#11184) @vyasr
- Move character case table declarations from src to detail (#11183) @davidwendt
- Remove usage of Frame in StringMethods (#11181) @vyasr
- Expose getjsonobject_options to Python (#11180) @SrikarVanavasam
- Fix decimal128 stats in parquet writer (#11179) @etseidl
- Modify CheckPageRows in parquet_test to use datasources (#11177) @etseidl
- Pin max version of
cuda-pythonto11.7.0(#11174) @Ethyling - Refactor and optimize Frame.where (#11168) @vyasr
- Add npos const static member to cudf::string_view (#11166) @davidwendt
- Move droprowsbylabel from Frame to IndexedFrame (#11157) @vyasr
- Clean up copytype_metadata (#11156) @vyasr
- Add
nvccconda package in dev environment (#11154) @galipremsagar - Struct binary comparison op functionality for spark rapids (#11153) @rwlee
- Refactor inline conditionals. (#11151) @bdice
- Refactor Spark hashing tests (#11145) @bdice
- Add new
_from_data_like_selffactory (#11140) @vyasr - Update get_cucollections to use rapids-cmake (#11139) @vyasr
- Remove unnecessary extra function for libcudacxx detection (#11138) @vyasr
- Allow initial value for cudf::reduce and cudf::segmented_reduce. (#11137) @SrikarVanavasam
- Remove Index.replace API (#11131) @vyasr
- Move char-type table function declarations from src to detail (#11127) @davidwendt
- Clean up repo root (#11124) @bdice
- Improve print formatting of strings containing newline characters. (#11108) @nvdbaranec
- Fix cudf::string_view::find() to return pos for empty string argument (#11107) @davidwendt
- Forward-merge branch-22.06 to branch-22.08 (#11086) @bdice
- Take iterators by value in clamp.cu. (#11084) @bdice
- Performance improvements for row to column conversions (#11075) @hyperbolic2346
- Remove deprecated Index methods from Frame (#11073) @vyasr
- Use per-page max compressed size estimate for compression (#11066) @devavret
- column to row refactor for performance (#11063) @hyperbolic2346
- Include
skbuilddirectory intobuild.shcleanoperation (#11060) @galipremsagar - Unpin
dask&distributedfor development (#11058) @galipremsagar - Add support for
Series.between(#11051) @galipremsagar - Fix groupby include (#11046) @bwyogatama
- Regex cleanup internal reclass and reclass_device classes (#11045) @davidwendt
- Remove public API of cudf.merge_sorted. (#11032) @bdice
- Drop python
3.7in code-base (#11029) @galipremsagar - Addition & integration of the integer power operator (#11025) @AtlantaPepsi
- Refactor
lists::contains(#11019) @ttnghia - Change build.sh to find C++ library by default and avoid shadowing CMAKE_ARGS (#11013) @vyasr
- Clean up parquet unit test (#11005) @PointKernel
- Add missing #pragma once to header files (#11004) @karthikeyann
- Cleanup
iterator.cuhand add fixed point support forscalar_optional_accessor(#10999) @ttnghia - Refactor
cudf::contains(#10997) @ttnghia - Remove Arrow CUDA IPC code (#10995) @shwina
- Change file extension for groupby benchmark (#10985) @ttnghia
- Sort recipe include checks. (#10984) @bdice
- Update cuCollections for thrust upgrade (#10983) @PointKernel
- Expose row-group size options in cudf ParquetWriter (#10980) @rjzamora
- Cleanup cudf::strings::detail::regex_parser class source (#10975) @davidwendt
- Handle missing fields as nulls in getjsonobject() (#10970) @SrikarVanavasam
- Fix license families to match all-caps expected by conda-verify. (#10931) @bdice
- Include <optional> for GCC 11 compatibility. (#10927) @bdice
- Enable builds with scikit-build (#10919) @vyasr
- Improve
distinctby usingcuco::static_map::retrieve_all(#10916) @PointKernel - update cudfjni to 22.08.0-SNAPSHOT (#10910) @pxLi
- Improve the capture of fatal cuda error (#10884) @sperlingxx
- Cleanup regex compiler operators and operands source (#10879) @davidwendt
- Buffer: make
.ptrread-only (#10872) @madsbk - Configurable NaN handling in devicerowcomparators (#10870) @rwlee
- Register
cudf.core.groupby.Grouperobjects to daskgrouper_dispatch(#10838) @brandon-b-miller - Upgrade to
arrow-8(#10816) @galipremsagar - Remove getattr method in RangeIndex class (#10538) @skirui-source
- Adding bins to value counts (#8247) @marlenezw
- C++
Published by GPUtester over 3 years ago
https://github.com/rapidsai/cudf - v22.08.00
π¨ Breaking Changes
- Remove legacy join APIs (#11274) @vyasr
- Remove
lists::drop_list_duplicates(#11236) @ttnghia - Remove Index.replace API (#11131) @vyasr
- Remove deprecated Index methods from Frame (#11073) @vyasr
- Remove public API of cudf.merge_sorted. (#11032) @bdice
- Drop python
3.7in code-base (#11029) @galipremsagar - Return empty dataframe when reading a Parquet file using empty
columnsoption (#11018) @vuule - Remove Arrow CUDA IPC code (#10995) @shwina
- Buffer: make
.ptrread-only (#10872) @madsbk
π Bug Fixes
- Fix
distributederror related toloop_in_thread(#11428) @galipremsagar - Relax arrow pinning to just 8.x and remove cuda build dependency from cudf recipe (#11412) @kkraus14
- Revert "Allow CuPy 11" (#11409) @jakirkham
- Fix
mototimeouts (#11369) @galipremsagar - Set
+/-infinityas theidentityvalues for floating-point numbers in device operatorsminandmax(#11357) @ttnghia - Fix memory_usage() for
ListSeries(#11355) @thomcom - Fix constructing Column from column_view with expired mask (#11354) @shwina
- Handle parquet corner case: Columns with more rows than are in the row group. (#11353) @nvdbaranec
- Fix
DatetimeIndex&TimedeltaIndexconstructors (#11342) @galipremsagar - Fix unsigned-compare compile warning in IntPow binops (#11339) @davidwendt
- Fix performance issue and add a new code path to
cudf::detail::contains(#11330) @ttnghia - Pin
pytorchto temporarily unblock fromlibcuptierrors (#11289) @galipremsagar - Workaround for nvcomp zstd overwriting blocks for orc due to underestimate of sizes (#11288) @jbrennan333
- Fix inconsistency when hashing two tables in
cudf::detail::contains(#11284) @ttnghia - Fix issue related to numpy array and
categorydtype (#11282) @galipremsagar - Add NotImplementedError when on is specified in DataFrame.join. (#11275) @vyasr
- Fix invalid allocatelike() and emptylike() tests. (#11268) @nvdbaranec
- Returns DataFrame When Concating Along Axis 1 (#11263) @isVoid
- Fix compile error due to missing header (#11257) @ttnghia
- Fix a memory aliasing/crash issue in scatter for lists. (#11254) @nvdbaranec
- Fix
tests/rolling/empty_input_test(#11238) @ttnghia - Fix const qualifier when using
host_span<bitmask_type const*>(#11220) @ttnghia - Avoid using
nvcompBatchedDeflateDecompressGetTempSizeExin cuIO (#11213) @vuule - Generate benchmark data with correct run length regardless of cardinality (#11205) @vuule
- Fix cumulative count index behavior (#11188) @brandon-b-miller
- Fix assertion in daskcudf teststruct_explode (#11170) @rjzamora
- Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager (#11161) @res-life
- Fix compatibility issues with pandas 1.4.3 (#11152) @vyasr
- Ensure cuco export set is installed in cmake build (#11147) @jlowe
- Avoid redundant deepcopy in
cudf.from_pandas(#11142) @galipremsagar - Fix compile error due to missing header (#11126) @ttnghia
- Fix
__cuda_array_interface__failures (#11113) @galipremsagar - Support octal and hex within regex character class pattern (#11112) @davidwendt
- Fix split_re matching logic for word boundaries (#11106) @davidwendt
- Handle multiple files metadata in
read_parquet(#11105) @galipremsagar - Fix index alignment for Series objects with repeated index (#11103) @shwina
- FindcuFile now searches in the current CUDA Toolkit location (#11101) @robertmaynard
- Fix regex word boundary logic to include underline (#11099) @davidwendt
- Exclude CudaFatalTest when selecting all Java tests (#11083) @jlowe
- Fix duplicate
cudatoolkitpinning issue (#11070) @galipremsagar - Maintain the input index in the result of a groupby-transform (#11068) @shwina
- Fix bug with row count comparison for expectcolumnsequivalent(). (#11059) @nvdbaranec
- Fix BPE uninitialized size value for null and empty input strings (#11054) @davidwendt
- Include missing header for usage of
get_current_device_resource()(#11047) @AtlantaPepsi - Fix warnunusedresult error in parquet test (#11026) @karthikeyann
- Return empty dataframe when reading a Parquet file using empty
columnsoption (#11018) @vuule - Fix small error in page row count limiting (#10991) @etseidl
- Fix a row index entry error in ORC writer issue (#10989) @vuule
- Fix grouped covariance to require both values to be convertible to double. (#10891) @bdice
π Documentation
- Fix issues with day & night modes in python docs (#11400) @galipremsagar
- Update missing data handling APIs in docs (#11345) @galipremsagar
- Add lists filtering APIs to doxygen group. (#11336) @bdice
- Remove unused import in README sample (#11318) @vyasr
- Note null behavior in
wheredocs (#11276) @brandon-b-miller - Update docstring for spans in
get_row_data_range(#11271) @vyasr - Update nvCOMP integration table (#11231) @vuule
- Add dev docs for documentation writing (#11217) @vyasr
- Documentation fix for concatenate (#11187) @dagardner-nv
- Fix unresolved links in markdown (#11173) @karthikeyann
- Fix cudf version in README.md install commands (#11164) @jvanstraten
- Switch
languagefromNoneto"en"in docs build (#11133) @galipremsagar - Remove docs mentioning scalar_view since no such class exists. (#11132) @bdice
- Add docstring entry for
DataFrame.value_counts(#11039) @galipremsagar - Add docs to rolling var, std, count. (#11035) @bdice
- Fix docs for Numba UDFs. (#11020) @bdice
- Replace column comparison utilities functions with macros (#11007) @karthikeyann
- Fix Doxygen warnings in multiple headers files (#11003) @karthikeyann
- Fix doxygen warnings in utilities/ headers (#10974) @karthikeyann
- Fix Doxygen warnings in table header files (#10964) @karthikeyann
- Fix Doxygen warnings in column header files (#10963) @karthikeyann
- Fix Doxygen warnings in strings / header files (#10937) @karthikeyann
- Generate Doxygen Tag File for Libcudf (#10932) @isVoid
- Fix doxygen warnings in structs, lists headers (#10923) @karthikeyann
- Fix doxygen warnings in fixed_point.hpp (#10922) @karthikeyann
- Fix doxygen warnings in ast/, rolling, tdigest/, wrappers/, dictionary/ headers (#10921) @karthikeyann
- fix doxygen warnings in cudf/io/types.hpp, other header files (#10913) @karthikeyann
- fix doxygen warnings in cudf/io/ avro, csv, json, orc, parquet header files (#10912) @karthikeyann
- Fix doxygen warnings in cudf/*.hpp (#10896) @karthikeyann
- Add missing documentation in aggregation.hpp (#10887) @karthikeyann
- Revise PR template. (#10774) @bdice
π New Features
- Change cmake to allow controlling Arrow version via cmake variable (#11429) @kkraus14
- Adding support for list<int8> columns to be written as byte arrays in parquet (#11328) @hyperbolic2346
- Adding byte array view structure (#11322) @hyperbolic2346
- Adding byte_array statistics (#11303) @hyperbolic2346
- Add column indexes to Parquet writer (#11302) @etseidl
- Provide an Option for Default Integer and Floating Bitwidth (#11272) @isVoid
- FST benchmark (#11243) @karthikeyann
- Adds the Finite-State Transducer algorithm (#11242) @elstehle
- Refactor
collect_setto usecudf::distinctandcudf::lists::distinct(#11228) @ttnghia - Treat zstd as stable in nvcomp releases 2.3.2 and later (#11226) @jbrennan333
- Add 24 bit dictionary support to Parquet writer (#11216) @devavret
- Enable positive group indices for extractAllRecord on JNI (#11215) @anthony-chang
- JNI bindings for NTH_ELEMENT window aggregation (#11201) @mythrocks
- Add JNI bindings for extractAllRecord (#11196) @anthony-chang
- Add
cudf.options(#11193) @isVoid - Add thrift support for parquet column and offset indexes (#11178) @etseidl
- Adding binary read/write as options for parquet (#11160) @hyperbolic2346
- Support
nth_elementfor window functions (#11158) @mythrocks - Implement
lists::distinctandcudf::detail::stable_distinct(#11149) @ttnghia - Implement Groupby pct_change (#11144) @skirui-source
- Add JNI for set operations (#11143) @ttnghia
- Remove deprecated PERTHREADDEFAULT_STREAM (#11134) @jbrennan333
- Added a Java method to check the existence of a list of keys in a map (#11128) @razajafri
- Feature/python benchmarking (#11125) @vyasr
- Support
nan_equalityincudf::distinct(#11118) @ttnghia - Added JNI for getMapValueForKeys (#11104) @razajafri
- Refactor
semi_anti_join(#11100) @ttnghia - Replace remaining instances of rmm::cudastreamdefault with cudf::defaultstreamvalue (#11082) @jbrennan333
- Adds the Logical Stack algorithm (#11078) @elstehle
- Add doxygen-check pre-commit hook (#11076) @karthikeyann
- Use new nvCOMP API to optimize the decompression temp memory size (#11064) @vuule
- Add Doxygen CI check (#11057) @karthikeyann
- Support
duplicate_keep_optionincudf::distinct(#11052) @ttnghia - Support set operations (#11043) @ttnghia
- Support for ZLIB compression in ORC writer (#11036) @vuule
- Adding feature swaplevels (#11027) @VamsiTallam95
- Use nvCOMP for ZLIB decompression in ORC reader (#11024) @vuule
- Function for bfill, ffill #9591 (#11022) @Sreekiran096
- Generate group offsets from element labels (#11017) @ttnghia
- Feature axes (#10979) @VamsiTallam95
- Generate group labels from offsets (#10945) @ttnghia
- Add missing cuIO benchmark coverage for duration types (#10933) @vuule
- Dask-cuDF cumulative groupby ops (#10889) @brandon-b-miller
- Reindex Improvements (#10815) @brandon-b-miller
- Implement value_counts for DataFrame (#10813) @martinfalisse
π οΈ Improvements
- Pin
dask&distributedfor release (#11433) @galipremsagar - Use documented header template for
doxygen(#11430) @galipremsagar - Relax arrow version in dev env (#11418) @galipremsagar
- Allow CuPy 11 (#11393) @jakirkham
- Improve multibyte_split performance (#11347) @cwharris
- Switch death test to use explicit trap. (#11326) @vyasr
- Add --output-on-failure to ctest args. (#11321) @vyasr
- Consolidate remaining DataFrame/Series APIs (#11315) @vyasr
- Add JNI support for the join_strings API (#11309) @revans2
- Add cupy version to setup.py install_requires (#11306) @vyasr
- removing some unused code (#11305) @hyperbolic2346
- Add test of wildcard selection (#11300) @vyasr
- Update parquet reader to take stream parameter (#11294) @PointKernel
- Spark list hashing (#11292) @bdice
- Remove legacy join APIs (#11274) @vyasr
- Fix
cudfrecipes syntax (#11273) @ajschmidt8 - Fix
cudfrecipe (#11267) @ajschmidt8 - Cleanup config files (#11266) @vyasr
- Run mypy on all packages (#11265) @vyasr
- Update to isort 5.10.1. (#11262) @vyasr
- Consolidate flake8 and pydocstyle configuration (#11260) @vyasr
- Remove redundant black config specifications. (#11258) @vyasr
- Ensure DeprecationWarnings are not introduced via pre-commit (#11255) @wence-
- Optimization to gpu::PreprocessColumnData in parquet reader. (#11252) @nvdbaranec
- Move rolling impl details to detail/ directory. (#11250) @mythrocks
- Remove
lists::drop_list_duplicates(#11236) @ttnghia - Use
cudf::lists::distinctin Python binding (#11234) @ttnghia - Use
cudf::lists::distinctin Java binding (#11233) @ttnghia - Use
cudf::distinctin Java binding (#11232) @ttnghia - Pin
dask-cudain dev environment (#11229) @galipremsagar - Remove cruft in map_lookup (#11221) @mythrocks
- Deprecate
skiprows&num_rowsin parquet reader (#11218) @galipremsagar - Remove Frame._index (#11210) @vyasr
- Improve performance for
cudf::containswhen searching for a scalar (#11202) @ttnghia - Document why Development component is needing for CMake. (#11200) @vyasr
- cleanup unused code in rolling_test.hpp (#11195) @karthikeyann
- Standardize join internals around DataFrame (#11184) @vyasr
- Move character case table declarations from src to detail (#11183) @davidwendt
- Remove usage of Frame in StringMethods (#11181) @vyasr
- Expose getjsonobject_options to Python (#11180) @SrikarVanavasam
- Fix decimal128 stats in parquet writer (#11179) @etseidl
- Modify CheckPageRows in parquet_test to use datasources (#11177) @etseidl
- Pin max version of
cuda-pythonto11.7.0(#11174) @Ethyling - Refactor and optimize Frame.where (#11168) @vyasr
- Add npos const static member to cudf::string_view (#11166) @davidwendt
- Move droprowsbylabel from Frame to IndexedFrame (#11157) @vyasr
- Clean up copytype_metadata (#11156) @vyasr
- Add
nvccconda package in dev environment (#11154) @galipremsagar - Struct binary comparison op functionality for spark rapids (#11153) @rwlee
- Refactor inline conditionals. (#11151) @bdice
- Refactor Spark hashing tests (#11145) @bdice
- Add new
_from_data_like_selffactory (#11140) @vyasr - Update get_cucollections to use rapids-cmake (#11139) @vyasr
- Remove unnecessary extra function for libcudacxx detection (#11138) @vyasr
- Allow initial value for cudf::reduce and cudf::segmented_reduce. (#11137) @SrikarVanavasam
- Remove Index.replace API (#11131) @vyasr
- Move char-type table function declarations from src to detail (#11127) @davidwendt
- Clean up repo root (#11124) @bdice
- Improve print formatting of strings containing newline characters. (#11108) @nvdbaranec
- Fix cudf::string_view::find() to return pos for empty string argument (#11107) @davidwendt
- Forward-merge branch-22.06 to branch-22.08 (#11086) @bdice
- Take iterators by value in clamp.cu. (#11084) @bdice
- Performance improvements for row to column conversions (#11075) @hyperbolic2346
- Remove deprecated Index methods from Frame (#11073) @vyasr
- Use per-page max compressed size estimate for compression (#11066) @devavret
- column to row refactor for performance (#11063) @hyperbolic2346
- Include
skbuilddirectory intobuild.shcleanoperation (#11060) @galipremsagar - Unpin
dask&distributedfor development (#11058) @galipremsagar - Add support for
Series.between(#11051) @galipremsagar - Fix groupby include (#11046) @bwyogatama
- Regex cleanup internal reclass and reclass_device classes (#11045) @davidwendt
- Remove public API of cudf.merge_sorted. (#11032) @bdice
- Drop python
3.7in code-base (#11029) @galipremsagar - Addition & integration of the integer power operator (#11025) @AtlantaPepsi
- Refactor
lists::contains(#11019) @ttnghia - Change build.sh to find C++ library by default and avoid shadowing CMAKE_ARGS (#11013) @vyasr
- Clean up parquet unit test (#11005) @PointKernel
- Add missing #pragma once to header files (#11004) @karthikeyann
- Cleanup
iterator.cuhand add fixed point support forscalar_optional_accessor(#10999) @ttnghia - Refactor
cudf::contains(#10997) @ttnghia - Remove Arrow CUDA IPC code (#10995) @shwina
- Change file extension for groupby benchmark (#10985) @ttnghia
- Sort recipe include checks. (#10984) @bdice
- Update cuCollections for thrust upgrade (#10983) @PointKernel
- Expose row-group size options in cudf ParquetWriter (#10980) @rjzamora
- Cleanup cudf::strings::detail::regex_parser class source (#10975) @davidwendt
- Handle missing fields as nulls in getjsonobject() (#10970) @SrikarVanavasam
- Fix license families to match all-caps expected by conda-verify. (#10931) @bdice
- Include <optional> for GCC 11 compatibility. (#10927) @bdice
- Enable builds with scikit-build (#10919) @vyasr
- Improve
distinctby usingcuco::static_map::retrieve_all(#10916) @PointKernel - update cudfjni to 22.08.0-SNAPSHOT (#10910) @pxLi
- Improve the capture of fatal cuda error (#10884) @sperlingxx
- Cleanup regex compiler operators and operands source (#10879) @davidwendt
- Buffer: make
.ptrread-only (#10872) @madsbk - Configurable NaN handling in devicerowcomparators (#10870) @rwlee
- Register
cudf.core.groupby.Grouperobjects to daskgrouper_dispatch(#10838) @brandon-b-miller - Upgrade to
arrow-8(#10816) @galipremsagar - Remove getattr method in RangeIndex class (#10538) @skirui-source
- Adding bins to value counts (#8247) @marlenezw
- C++
Published by GPUtester over 3 years ago
https://github.com/rapidsai/cudf - v22.06.01
v22.06.01
- C++
Published by GPUtester over 3 years ago
https://github.com/rapidsai/cudf - v22.06.00
π¨ Breaking Changes
- Enable Zstandard decompression only when all nvcomp integrations are enabled (#10944) @vuule
- Rename
sliced_childtoget_sliced_child. (#10885) @bdice - Add parameters to control page size in Parquet writer (#10882) @etseidl
- Make cudf::test::expectcolumnsequal() to fail when comparing unsanitary lists. (#10880) @nvdbaranec
- Cleanup regex compiler fixed quantifiers source (#10843) @davidwendt
- Refactor
cudf::contains, renaming and switching parameters role (#10802) @ttnghia - Generic serialization of all column types (#10784) @wence-
- Return per-file metadata from readers (#10782) @vuule
- HostColumnVectoreCore#isNull should return true for out-of-range rows (#10779) @gerashegalov
- Update
groupby::hashto use new row operators for keys (#10770) @PointKernel - update mangledupecols behavior in csv reader to match pandas 1.4.0 behavior (#10749) @karthikeyann
- Rename CUDATRY macro to CUDFCUDATRY, rename CHECKCUDA macro to CUDFCHECKCUDA. (#10589) @bdice
- Upgrade
cudfto supportpandas1.4.x versions (#10584) @galipremsagar - Move binop methods from Frame to IndexedFrame and standardize the docstring (#10576) @vyasr
- Add default= kwarg to .list.get() accessor method (#10547) @shwina
- Remove deprecated
decimal_cols_as_floatin the ORC reader (#10515) @vuule - Support nvComp 2.3 if local, otherwise use nvcomp 2.2 (#10513) @robertmaynard
- Fix findall_record to return empty list for no matches (#10491) @davidwendt
- Namespace/Docstring Fixes for Reduction (#10471) @isVoid
- Additional refactoring of hash functions (#10462) @bdice
- Fix default value of str.split expand parameter. (#10457) @bdice
- Remove deprecated code. (#10450) @vyasr
π Bug Fixes
- Fix single column
MultiIndexissue insort_index(#10957) @galipremsagar - Make SerializedTableHeader(numRows) public (#10949) @gerashegalov
- Fix
gcc_linuxversion pinning in dev environment (#10943) @galipremsagar - Fix an issue with reading raw string in
cudf.read_json(#10924) @galipremsagar - Make cudf::test::expectcolumnsequal() to fail when comparing unsanitary lists. (#10880) @nvdbaranec
- Fix segmented_reduce on empty column with non-empty offsets (#10876) @davidwendt
- Fix dask-cudf groupby handling when grouping by all columns (#10866) @charlesbluca
- Fix a bug in
distinct: using nested nulls logic (#10848) @PointKernel - Fix constness / references in weak ordering operator() signatures. (#10846) @bdice
- Suppress sizeof-array-div warnings in thrust found by gcc-11 (#10840) @robertmaynard
- Add handling for string by-columns in dask-cudf groupby (#10830) @charlesbluca
- Fix compile warning in search.cu (#10827) @davidwendt
- Fix element access const correctness in
hostdevice_vector(#10804) @vuule - Update
cucogit tag (#10788) @PointKernel - HostColumnVectoreCore#isNull should return true for out-of-range rows (#10779) @gerashegalov
- Fixing deprecation warnings in test_orc.py (#10772) @hyperbolic2346
- Enable writing to
s3storage in chunked parquet writer (#10769) @galipremsagar - Fix construction of nested structs with EMPTY child (#10761) @shwina
- Fix replace error when regex has only zero match quantifiers (#10760) @davidwendt
- Fix an issue with onelevellist schemas in parquet reader. (#10750) @nvdbaranec
- update mangledupecols behavior in csv reader to match pandas 1.4.0 behavior (#10749) @karthikeyann
- Fix
cupyfunction in notebook (#10737) @ajschmidt8 - Fix
fillnato retaincolumnswhen it isMultiIndex(#10729) @galipremsagar - Fix scatter for all-empty-string column case (#10724) @davidwendt
- Retain series name in
Series.apply(#10716) @brandon-b-miller - Correct build dir
cudf-configdependency issues for static builds (#10704) @robertmaynard - Fix list of testing requirements in setup.py. (#10678) @bdice
- Fix rounding to zero error in stod on very small float numbers (#10672) @davidwendt
- cuco isn't a cudf dependency when we are built shared (#10662) @robertmaynard
- Fix to_timestamps to support Z for %z format specifier (#10617) @davidwendt
- Verify compression type in Parquet reader (#10610) @vuule
- Fix struct row comparator's exception on empty structs (#10604) @sperlingxx
- Fix strings strip() to accept only str Scalar for to_strip parameter (#10597) @davidwendt
- Fix hasatomicsupport check in canusehash_groupby() (#10588) @jbrennan333
- Revert Thrust 1.16 to Thrust 1.15 (#10586) @bdice
- Fix missing RMMSTATICCUDART define when compiling JNI with static CUDA runtime (#10585) @jlowe
- pin more cmake versions (#10570) @robertmaynard
- Re-enable Build Metrics Report (#10562) @davidwendt
- Remove statically linked CUDA runtime check in Java build (#10532) @jlowe
- Fix temp data cleanup in
test_text.py(#10524) @brandon-b-miller - Update pre-commit to run black 22.3.0 (#10523) @vyasr
- Remove deprecated
decimal_cols_as_floatin the ORC reader (#10515) @vuule - Fix findall_record to return empty list for no matches (#10491) @davidwendt
- Allow users to specify data types for a subset of columns in
read_csv(#10484) @vuule - Fix default value of str.split expand parameter. (#10457) @bdice
- Improve coverage of dask-cudf's groupby aggregation, add tests for
dropnasupport (#10449) @charlesbluca - Allow string aggs for
dask_cudf.CudfDataFrameGroupBy.aggregate(#10222) @charlesbluca - In-place updates with loc or iloc don't work correctly when the LHS has more than one column (#9918) @skirui-source
π Documentation
- Clarify append deprecation notice. (#10930) @bdice
- Use full name of GPUDirect Storage SDK in docs (#10904) @vuule
- Update Dask + Pandas to Dask + cuDF path (#10897) @miguelusque
- Add missing documentation in cudf/types.hpp (#10895) @karthikeyann
- Add strong index iterator docs. (#10888) @bdice
- spell check fixes (#10865) @karthikeyann
- Add missing documentation in scalar/ headers (#10861) @karthikeyann
- Remove typo in ngram documentation (#10859) @miguelusque
- fix doxygen warnings (#10842) @karthikeyann
- Add a library_design.md file documenting the core Python data structures and their relationship (#10817) @vyasr
- Add NumPy to intersphinx references. (#10809) @bdice
- Add a section to the docs that compares cuDF with Pandas (#10796) @shwina
- Mention 2 cpp-reviewer requirement in pull request template (#10768) @davidwendt
- Enable pydocstyle for all packages. (#10759) @bdice
- Enable pydocstyle rules involving quotes (#10748) @vyasr
- Revise 10 minutes notebook. (#10738) @bdice
- Reorganize cuDF Python docs (#10691) @shwina
- Fix sphinx/jupyter heading issue in UDF notebook (#10690) @brandon-b-miller
- Migrated user guide notebooks to MyST-NB and added sphinx extension (#10685) @mmccarty
- add data generation to benchmark documentation (#10677) @karthikeyann
- Fix some docs build warnings (#10674) @galipremsagar
- Update UDF notebook in User Guide. (#10668) @bdice
- Improve User Guide docs (#10663) @bdice
- Fix some docstrings formatting (#10660) @galipremsagar
- Remove implementation details from
applydocstrings (#10651) @brandon-b-miller - Revise CONTRIBUTING.md (#10644) @bdice
- Add missing APIs to documentation. (#10643) @bdice
- Use cudf.read_json as documented API name. (#10640) @bdice
- Fix docstring section headings. (#10639) @bdice
- Document cudf.readtext and cudf.readavro. (#10638) @bdice
- Fix type-o in docstring for jsonreaderoptions (#10627) @dagardner-nv
- Update guide to UDFs with notes about
Series.applymapdeprecation and related changes (#10607) @brandon-b-miller - Fix doxygen Modules page for cudf::lists::sequences (#10561) @davidwendt
- Add Replace Backreferences section to Regex Features page (#10560) @davidwendt
- Introduce deprecation policy to developer guide. (#10252) @vyasr
π New Features
- Enable Zstandard decompression only when all nvcomp integrations are enabled (#10944) @vuule
- Handle nested types in cudf::concatenate_rows() (#10890) @nvdbaranec
- Strong index types for equality comparator (#10883) @ttnghia
- Add parameters to control page size in Parquet writer (#10882) @etseidl
- Support for Zstandard decompression in ORC reader (#10873) @vuule
- Use pre-built nvcomp 2.3 binaries by default (#10851) @robertmaynard
- Support for Zstandard decompression in Parquet reader (#10847) @vuule
- Add JNI support for applybooleanmask (#10812) @res-life
- Segmented Min/Max for Fixed Point Types (#10794) @isVoid
- Return per-file metadata from readers (#10782) @vuule
- Segmented
apply_boolean_maskforLISTcolumns (#10773) @mythrocks - Update
groupby::hashto use new row operators for keys (#10770) @PointKernel - Support purging non-empty null elements from LIST/STRING columns (#10701) @mythrocks
- Add
detail::hash_join(#10695) @PointKernel - Persist string statistics data across multiple calls to orc chunked write (#10694) @hyperbolic2346
- Add
.list.astype()to cast list leaves to specified dtype (#10693) @shwina - JNI: Add generateListOffsets API (#10683) @sperlingxx
- Support
argsin groupby apply (#10682) @brandon-b-miller - Enable segmented_gather in Java package (#10669) @sperlingxx
- Add row hasher with nested column support (#10641) @devavret
- Add support for numericonly in DataFrame.reduce (#10629) @martinfalisse
- First step toward statistics in ORC files with chunked writes (#10567) @hyperbolic2346
- Add support for struct columns to the random table generator (#10566) @vuule
- Enable passing a sequence for the
indexargument to.list.get()(#10564) @shwina - Add python bindings for cudf::list::index_of (#10549) @ChrisJar
- Add default= kwarg to .list.get() accessor method (#10547) @shwina
- Add
cudf.DataFrame.applymap(#10542) @brandon-b-miller - Support nvComp 2.3 if local, otherwise use nvcomp 2.2 (#10513) @robertmaynard
- Add column field ID control in parquet writer (#10504) @PointKernel
- Deprecate
Series.applymap(#10497) @brandon-b-miller - Add option to drop cache in cuIO benchmarks (#10488) @vuule
- move benchmark input generation in device in reduction nvbench (#10486) @karthikeyann
- Support Segmented Min/Max Reduction on String Type (#10447) @isVoid
- List element Equality comparator (#10289) @devavret
- Implement all methods of groupby rank aggregation in libcudf, python (#9569) @karthikeyann
- Implement DataFrame.eval using libcudf ASTs (#8022) @vyasr
π οΈ Improvements
- Use
condacompilers in env file (#10915) @galipremsagar - Remove C style artifacts in cuIO (#10886) @vuule
- Rename
sliced_childtoget_sliced_child. (#10885) @bdice - Replace defaulted stream value for libcudf APIs that use NVCOMP (#10877) @jbrennan333
- Add more unit tests for
cudf::distinctfor nested types with sliced input (#10860) @ttnghia - Changing
list_view.cuhtolist_view.hpp(#10854) @ttnghia - More error checking in
from_dlpack(#10850) @wence- - Cleanup regex compiler fixed quantifiers source (#10843) @davidwendt
- Adds the JNI call for Cuda.deviceSynchronize (#10839) @abellina
- Add missing cuda-python dependency to cudf (#10833) @bdice
- Change std::string parameters in cudf::strings APIs to std::string_view (#10832) @davidwendt
- Split up search.cu to improve compile time (#10831) @davidwendt
- Add tests for null scalar binaryops (#10828) @brandon-b-miller
- Cleanup regex compile optimize functions (#10825) @davidwendt
- Use
ThreadedMotoServerinstead ofsubprocessin spinning ups3server (#10822) @galipremsagar - Import
NAfrommissingrather than usingcudf.NAeverywhere (#10821) @brandon-b-miller - Refactor regex builtin character-class identifiers (#10814) @davidwendt
- Change pattern parameter for regex APIs from std::string to std::string_view (#10810) @davidwendt
- Make the JNI API to get list offsets as a view public. (#10807) @revans2
- Add cudf JNI docker build github action (#10806) @pxLi
- Removed
mrparameter from inplace bitmask operations (#10805) @AtlantaPepsi - Refactor
cudf::contains, renaming and switching parameters role (#10802) @ttnghia - Handle closed property in IntervalDtype.from_pandas (#10798) @wence-
- Return weak orderings from
device_row_comparator. (#10793) @rwlee - Rework
Scalarimports (#10791) @brandon-b-miller - Enable ccache for cudfjni build in Docker (#10790) @gerashegalov
- Generic serialization of all column types (#10784) @wence-
- simplifying skiprows test in test_orc.py (#10783) @hyperbolic2346
- Use columnviews instead of columndevice_views in binary operations. (#10780) @bdice
- Add struct utility functions. (#10776) @bdice
- Add multiple rows to subword tokenizer benchmark (#10767) @davidwendt
- Refactor host decompression in ORC reader (#10764) @vuule
- Flush output streams before creating a process to drop caches (#10762) @vuule
- Refactor binaryop/compiled/util.cpp (#10756) @bdice
- Use warp per string for long strings in cudf::strings::contains() (#10739) @davidwendt
- Use generator expressions in any/all functions. (#10736) @bdice
- Use canonical "magic methods" (replace
x.__repr__()withrepr(x)). (#10735) @bdice - Improve use of isinstance. (#10734) @bdice
- Rename tests from multiIndex to multiindex. (#10732) @bdice
- Two-table comparators with strong index types (#10730) @bdice
- Replace std::make_pair with std::pair (C++17 CTAD) (#10727) @karthikeyann
- Use structured bindings instead of std::tie (#10726) @karthikeyann
- Missing
fprefix on f-strings fix (#10721) @code-review-doctor - Add
max_file_sizeparameter to chunked parquet dataset writer (#10718) @galipremsagar - Deprecate
merge_sorted, change dask cudf usage to internal method (#10713) @isVoid - Prepare daskcudf testparquet.py for upcoming API changes (#10709) @rjzamora
- Remove or simplify various utility functions (#10705) @vyasr
- Allow building arrow with parquet and not python (#10702) @revans2
- Partial cuIO GPU decompression refactor (#10699) @vuule
- Cython API refactor:
merge.pyx(#10698) @isVoid - Fix random string data length to become variable (#10697) @galipremsagar
- Add bindings for index_of with column search key (#10696) @ChrisJar
- Deprecate index merging (#10689) @vyasr
- Remove cudf::strings::string namespace (#10684) @davidwendt
- Standardize imports. (#10680) @bdice
- Standardize usage of collections.abc. (#10679) @bdice
- Cython API Refactor:
transpose.pyx,sort.pyx(#10675) @isVoid - Add devicememoryresource parameter to createstringvectorfromcolumn (#10673) @davidwendt
- Split up mixed-join kernels source files (#10671) @davidwendt
- Use
std::filesystemfor temporary directory location and deletion (#10664) @vuule - cleanup benchmark includes (#10661) @karthikeyann
- Use upstream clang-format pre-commit hook. (#10659) @bdice
- Clean up C++ includes to use <> instead of "". (#10658) @bdice
- Handle RuntimeError thrown by CUDA Python in
validate_setup(#10653) @shwina - Rework JNI CMake to leverage rapidsfindpackage (#10649) @jlowe
- Use conda to build python packages during GPU tests (#10648) @Ethyling
- Deprecate various functions that don't need to be defined for Index. (#10647) @vyasr
- Update pinning to allow newer CMake versions. (#10646) @vyasr
- Bump hadoop-common from 3.1.4 to 3.2.3 in /java (#10645) @dependabot[bot]
- Remove
concurrent_unordered_multimap. (#10642) @bdice - Improve parquet dictionary encoding (#10635) @PointKernel
- Improve cudf::cuda_error (#10630) @sperlingxx
- Add support for null and non-numeric types in Series.diff and DataFrame.diff (#10625) @Matt711
- Branch 22.06 merge 22.04 (#10624) @vyasr
- Unpin
dask&distributedfor development (#10623) @galipremsagar - Slightly improve accuracy of stod in to_floats (#10622) @davidwendt
- Allow libcudfjni to be built as a static library (#10619) @jlowe
- Change stack-based regex state data to use global memory (#10600) @davidwendt
- Resolve Forward merging of
branch-22.04intobranch-22.06(#10598) @galipremsagar - KvikIO as an alternative GDS backend (#10593) @madsbk
- Rename CUDATRY macro to CUDFCUDATRY, rename CHECKCUDA macro to CUDFCHECKCUDA. (#10589) @bdice
- Upgrade
cudfto supportpandas1.4.x versions (#10584) @galipremsagar - Refactor binary ops for timedelta and datetime columns (#10581) @vyasr
- Refactor cudf::strings::countre API to use countmatches utility (#10580) @davidwendt
- Update
Programming Language :: PythonVersions to 3.8 & 3.9 (#10579) @madsbk - Automate Java cudf jar build with statically linked dependencies (#10578) @gerashegalov
- Add patch for thrust-cub 1.16 to fix sort compile times (#10577) @davidwendt
- Move binop methods from Frame to IndexedFrame and standardize the docstring (#10576) @vyasr
- Cleanup libcudf strings regex classes (#10573) @davidwendt
- Simplify preprocessing of arguments for DataFrame binops (#10563) @vyasr
- Reduce kernel calls to build strings findall results (#10559) @davidwendt
- Forward-merge branch-22.04 to branch-22.06 (#10557) @bdice
- Update strings contains benchmark to measure varying match rates (#10555) @davidwendt
- JNI: throw CUDA errors more specifically (#10551) @sperlingxx
- Enable building static libs (#10545) @trxcllnt
- Remove pip requirements files. (#10543) @bdice
- Remove Click pinnings that are unnecessary after upgrading black. (#10541) @vyasr
- Refactor
memory_usageto improve performance (#10537) @galipremsagar - Adjust the valid range of group index for replacewithbackrefs (#10530) @sperlingxx
- add accidentally removed comment. (#10526) @vyasr
- Update conda environment. (#10525) @vyasr
- Remove ColumnBase.getitem (#10516) @vyasr
- Optimize
left_semi_joinby materializing the gather mask (#10511) @cheinger - Define proper binary operation APIs for columns (#10509) @vyasr
- Upgrade
arrow-cpp&pyarrowto7.0.0(#10503) @galipremsagar - Update to Thrust 1.16 (#10489) @bdice
- Namespace/Docstring Fixes for Reduction (#10471) @isVoid
- Update cudfjni 22.06.0-SNAPSHOT (#10467) @pxLi
- Use Lists of Columns for Various Files (#10463) @isVoid
- Additional refactoring of hash functions (#10462) @bdice
- Fix Series.str.findall behavior for expand=False. (#10459) @bdice
- Remove deprecated code. (#10450) @vyasr
- Update cmake-format version. (#10440) @vyasr
- Consolidate C++
condarecipes and addlibcudf-testspackage (#10326) @ajschmidt8 - Use conda compilers (#10275) @Ethyling
- Add row bitmask as a
detail::hash_joinmember (#10248) @PointKernel
- C++
Published by GPUtester over 3 years ago
https://github.com/rapidsai/cudf - v22.04.00
π¨ Breaking Changes
- Drop unsupported method argument from nunique and distinct_count. (#10411) @bdice
- Refactor stream compaction APIs (#10370) @PointKernel
- Add scanaggregation and reduceaggregation derived types. (#10357) @nvdbaranec
- Avoid
decimaltype narrowing for decimal binops (#10299) @galipremsagar - Rewrites
sampleAPI (#10262) @isVoid - Remove probe-time null equality parameters in
cudf::hash_join(#10260) @PointKernel - Enable proper
Indexround-tripping inorcreader and writer (#10170) @galipremsagar - Add JNI for
strings::split_reandstrings::split_record_re(#10139) @ttnghia - Change cudf::strings::find_multiple to return a lists column (#10134) @davidwendt
- Remove the option to completely disable decimal128 columns in the ORC reader (#10127) @vuule
- Remove deprecated code (#10124) @vyasr
- Update gpu_utils.py to reflect current CUDA support. (#10113) @bdice
- Optimize compaction operations (#10030) @PointKernel
- Remove deprecated method Series.set_index. (#9945) @bdice
- Add cudf::strings::findall_record API (#9911) @davidwendt
- Upgrade
arrow&pyarrowto6.0.1(#9686) @galipremsagar
π Bug Fixes
- Fix an issue with tdigest merge aggregations. (#10506) @nvdbaranec
- Batch of fixes for index overflows in grid stride loops. (#10448) @nvdbaranec
- Update dask_cudf imports to be compatible with latest dask (#10442) @rlratzel
- Fix for integer overflow in contiguous-split (#10437) @jbrennan333
- Fix hasnull predicate for droplist_duplicates on nested structs (#10436) @sperlingxx
- Fix empty reduce with List output and non-List input (#10435) @sperlingxx
- Fix
listandstructmeta generation issue indask-cudf(#10434) @galipremsagar - Fix error in
cudf.to_numericwhen aboolinput is passed (#10431) @galipremsagar - Support cupy array in
quantileinput (#10429) @galipremsagar - Fix benchmarks to work with new aggregation types (#10428) @davidwendt
- Fix cudf::shift to handle offset greater than column size (#10414) @davidwendt
- Fix lifespan of the temporary directory that holds cuFile configuration file (#10403) @vuule
- Fix error thrown in compiled-binaryop benchmark (#10398) @davidwendt
- Limiting async allocator using alignment of 512 (#10395) @rongou
- Include <optional> in multibyte split. (#10385) @bdice
- Fix issue with column and scalar re-assignment (#10377) @galipremsagar
- Fix floating point data generation in benchmarks (#10372) @vuule
- Avoid overflow in fusedconcatenatekernel output_index (#10344) @abellina
- Remove isrelationallycomparable for table device views (#10342) @davidwendt
- Fix debug compile error in devicespan to columnview conversion (#10331) @davidwendt
- Add Pascal support to JCUDF transcode (row_conversion) (#10329) @mythrocks
- Fix
std::bad_allocexception due to JIT reserving a huge buffer (#10317) @ttnghia - Fixes up the overflowed fixed-point round on nullable column (#10316) @sperlingxx
- Fix DataFrame slicing issues for empty cases (#10310) @brandon-b-miller
- Fix documentation issues (#10307) @ajschmidt8
- Allow Java bindings to use default decimal precisions when writing columns (#10276) @sperlingxx
- Fix incorrect slicing of GDS read/write calls (#10274) @vuule
- Fix out-of-memory error in compiled-binaryop benchmark (#10269) @davidwendt
- Add tests of reflected ufuncs and fix behavior of logical reflected ufuncs (#10261) @vyasr
- Remove probe-time null equality parameters in
cudf::hash_join(#10260) @PointKernel - Fix out-of-memory error in UrlDecode benchmark (#10258) @davidwendt
- Fix groupby reductions that perform operations on source type instead of target type (#10250) @ttnghia
- Fix small leak in explode (#10245) @revans2
- Yet another small JNI memory leak (#10238) @revans2
- Fix regex octal parsing to limit to 3 characters (#10233) @davidwendt
- Fix string to decimal128 conversion handling large exponents (#10231) @davidwendt
- Fix JNI leak on copy to device (#10229) @revans2
- Fix the data generator element size for decimal types (#10225) @vuule
- Fix
decimalmetadata in parquet writer (#10224) @galipremsagar - Fix strings handling of hex in regex pattern (#10220) @davidwendt
- Fix docs builds (#10216) @ajschmidt8
- Fix a leftover hasnulls change from Nullate (#10211) @devavret
- Fix bitmask of the output for JNI of
lists::drop_list_duplicates(#10210) @ttnghia - Fix compile error in
binaryop/compiled/util.cpp(#10209) @ttnghia - Skip ORC and Parquet readers' benchmark cases that are not currently supported (#10194) @vuule
- Fix JNI leak of a cudf::column_view native class. (#10171) @revans2
- Enable proper
Indexround-tripping inorcreader and writer (#10170) @galipremsagar - Convert Column Name to String Before Using Struct Column Factory (#10156) @isVoid
- Preserve the correct
ListDtypewhile creating an identical empty column (#10151) @galipremsagar - benchmark fixture - static object pointer fix (#10145) @karthikeyann
- Fix UDF Caching (#10133) @brandon-b-miller
- Raise duplicate column error in
DataFrame.rename(#10120) @galipremsagar - Fix flaky memory usage test by guaranteeing array size. (#10114) @vyasr
- Encode values from python callback for C++ (#10103) @jdye64
- Add check for regex instructions causing an infinite-loop (#10095) @davidwendt
- Remove metadata singleton from nvtext normalizer (#10090) @davidwendt
- Column equality testing fixes (#10011) @brandon-b-miller
- Pin libcudf runtime dependency for cudf / libcudf-kafka nightlies (#9847) @charlesbluca
π Documentation
- Fix documentation for DataFrame.corr and Series.corr. (#10493) @bdice
- Add
cutto API docs (#10479) @shwina - Remove documentation for methods removed in #10124. (#10366) @bdice
- Fix documentation issues (#10306) @ajschmidt8
- Fix
fixed_pointbinary operation documentation (#10198) @codereport - Remove cleaned up methods from docs (#10189) @galipremsagar
- Update developer guide to recommend no default stream parameter. (#10136) @bdice
- Update benchmarking guide to use NVBench. (#10093) @bdice
π New Features
- Add StringIO support to read_text (#10465) @cwharris
- Add support for tdigest and merge_tdigest aggregations through cudf::reduce (#10433) @nvdbaranec
- JNI support for Collect Ops in Reduction (#10427) @sperlingxx
- Enable readtext with daskcudf using byte_range (#10407) @ChrisJar
- Add
cudf::stable_sort_by_key(#10387) @PointKernel - Implement
maps_column_viewabstraction overLIST<STRUCT<K,V>>(#10380) @mythrocks - Support Java bindings for Avro reader (#10373) @HaoYang670
- Refactor stream compaction APIs (#10370) @PointKernel
- Support collect aggregations in reduction (#10353) @sperlingxx
- Refactor array_ufunc for Index and unify across all classes (#10346) @vyasr
- Add JNI for extractlistelement with index column (#10341) @firestarman
- Support
minandmaxoperations for structs in rolling window (#10332) @ttnghia - Add device createsequencetable for benchmarks (#10300) @karthikeyann
- Enable numpy ufuncs for DataFrame (#10287) @vyasr
- move input generation for json benchmark to device (#10281) @karthikeyann
- move input generation for type dispatcher benchmark to device (#10280) @karthikeyann
- move input generation for copy benchmark to device (#10279) @karthikeyann
- generate url decode benchmark input in device (#10278) @karthikeyann
- device input generation in join bench (#10277) @karthikeyann
- Add nvtext::bytepairencoding API (#10270) @davidwendt
- Prevent internal usage of expensive APIs (#10263) @vyasr
- Column to JCUDF row for tables with strings (#10235) @hyperbolic2346
- Support
percent_rank()aggregation (#10227) @mythrocks - Refactor Series.array_ufunc (#10217) @vyasr
- Reduce pytest runtime (#10203) @brandon-b-miller
- Add regex flags parameter to python cudf strings split (#10185) @davidwendt
- Support for
MOD,PMODandPYMODfordecimal32/64/128(#10179) @codereport - Adding string row size iterator for row to column and column to row conversion (#10157) @hyperbolic2346
- Add file size counter to cuIO benchmarks (#10154) @vuule
- byterange support for multibytesplit/read_text (#10150) @cwharris
- Add JNI for
strings::split_reandstrings::split_record_re(#10139) @ttnghia - Add
maxSplitparameter to Java binding forstrings:split(#10137) @ttnghia - Add libcudf strings split API that accepts regex pattern (#10128) @davidwendt
- generate benchmark input in device (#10109) @karthikeyann
- Avoid
nan_as_nullop ifnan_countis 0 (#10082) @galipremsagar - Add Dataframe and Index nunique (#10077) @martinfalisse
- Support nanosecond timestamps in parquet (#10063) @PointKernel
- Java bindings for mixed semi and anti joins (#10040) @jlowe
- Implement mixed equality/conditional semi/anti joins (#10037) @vyasr
- Optimize compaction operations (#10030) @PointKernel
- Support
args=inSeries.apply(#9982) @brandon-b-miller - Add cudf::strings::findall_record API (#9911) @davidwendt
- Add covariance for sort groupby (python) (#9889) @mayankanand007
- Implement DataFrame diff() (#9817) @skirui-source
- Implement DataFrame pct_change (#9805) @skirui-source
- Support segmented reductions and null mask reductions (#9621) @isVoid
- Add 'spearman' correlation method for
dataframe.corrandseries.corr(#7141) @dominicshanshan
π οΈ Improvements
- Add
scipyskip for a test (#10502) @galipremsagar - Temporarily disable new
ops-botfunctionality (#10496) @ajschmidt8 - Include <cstddef> to fix compilation of parquet reader on GCC 11. (#10483) @bdice
- Pin
daskanddistributed(#10481) @galipremsagar - MD5 refactoring. (#10445) @bdice
- Remove or split up Frame methods that use the index (#10439) @vyasr
- Centralization of tdigest aggregation code. (#10422) @nvdbaranec
- Simplify column binary operations (#10421) @vyasr
- Add
.github/ops-bot.yamlconfig file (#10420) @ajschmidt8 - Use list of columns for methods in
Groupby.pyx(#10419) @isVoid - Remove warnings in
test_timedelta.py(#10418) @galipremsagar - Fix some warnings in
test_parquet.py(#10416) @galipremsagar - JNI support for segmented reduce (#10413) @revans2
- Clean up null mask after purging null entries (#10412) @sperlingxx
- Drop unsupported method argument from nunique and distinct_count. (#10411) @bdice
- Use str instead of builtins.str. (#10410) @bdice
- Fix warnings in
test_rolling(#10405) @bdice - Enable
codecovgithub-check in CI (#10404) @galipremsagar - Fix warnings in testcudaapply, testnumerical, testpickling, test_unaops. (#10402) @bdice
- Set column names in
_from_columns_like_selffactory (#10400) @isVoid - Refactor
nvtxannotations incudf&dask-cudf(#10396) @galipremsagar - Consolidate .cov and .corr for sort groupby (#10386) @skirui-source
- Consolidate some Frame APIs (#10381) @vyasr
- Refactor hash functions and
hash_combine(#10379) @bdice - Add
nvtxannotations forSeriesandIndex(#10374) @galipremsagar - Refactor
filling.repeatAPI (#10371) @isVoid - Move standalone UTF8 functions from string_view.hpp to utf8.hpp (#10369) @davidwendt
- Remove doc for deprecated function
one_hot_encoding(#10367) @isVoid - Refactor array function (#10364) @vyasr
- Fix warnings in test_csv.py. (#10362) @bdice
- Implement a mixin for binops (#10360) @vyasr
- Refactor cython interface:
copying.pyx(#10359) @isVoid - Implement a mixin for scans (#10358) @vyasr
- Add scanaggregation and reduceaggregation derived types. (#10357) @nvdbaranec
- Add cleanup of python artifacts (#10355) @galipremsagar
- Fix warnings in test_categorical.py. (#10354) @bdice
- Create a dispatcher for invoking regex kernel functions (#10349) @davidwendt
- Fix
codecovin CI (#10347) @galipremsagar - Enable caching for
memory_usagecalculation inColumn(#10345) @galipremsagar - C++17 cleanup: traits replace std::enableif<>::type with std::enableif_t (#10343) @karthikeyann
- JNI: Support appending DECIMAL128 into ColumnBuilder in terms of byte array (#10338) @sperlingxx
- multibyte_split test improvements (#10328) @vuule
- Fix warnings in test_binops.py. (#10327) @bdice
- Fix warnings from pandas in testarrayufunc.py. (#10324) @bdice
- Update upload script (#10321) @ajschmidt8
- Move hash type declarations to hashing.hpp (#10320) @davidwendt
- C++17 cleanup: traits replace
::valuewith_v(#10319) @karthikeyann - Remove internal columns usage (#10315) @vyasr
- Remove extraneous
build.shparameter (#10313) @ajschmidt8 - Add const qualifier to MurmurHash332::hashcombine (#10311) @davidwendt
- Remove
TODOinlibcudf_kafkarecipe (#10309) @ajschmidt8 - Add conversions between columnview and devicespan<T const>. (#10302) @bdice
- Avoid
decimaltype narrowing for decimal binops (#10299) @galipremsagar - Deprecate
DataFrame.iteritemsand introduce.items(#10298) @galipremsagar - Explicitly request CMake use
gnu++17overc++17(#10297) @robertmaynard - Add copyright check as pre-commit hook. (#10290) @vyasr
- DataFrame
insertand creation optimizations (#10285) @galipremsagar - Improve hash join detail functions (#10273) @PointKernel
- Replace custom
cached_propertyimplementation with functools (#10272) @shwina - Rewrites
sampleAPI (#10262) @isVoid - Bump hadoop-common from 3.1.0 to 3.1.4 in /java (#10259) @dependabot[bot]
- Remove making redundant
copyacross code-base (#10257) @galipremsagar - Add more
nvtxannotations (#10256) @galipremsagar - Add
copyrightcheck incudf(#10253) @galipremsagar - Remove redundant copies in
fillnato improve performance (#10241) @galipremsagar - Remove
std::numeric_limitspecializations for timestamp & durations (#10239) @codereport - Optimize
DataFramecreation across code-base (#10236) @galipremsagar - Change pytest distribution algorithm and increase parallelism in CI (#10232) @galipremsagar
- Add environment variables for I/O thread pool and slice sizes (#10218) @vuule
- Add regex flags to strings findall functions (#10208) @davidwendt
- Update dask-cudf parquet tests to reflect upstream bugfixes to
_metadata(#10206) @charlesbluca - Remove unnecessary nunique function in Series. (#10205) @martinfalisse
- Refactor DataFrame tests. (#10204) @bdice
- Rewrites
column.__setitem__, Useboolean_mask_scatter(#10202) @isVoid - Java utilities to aid in accelerating aggregations on 128-bit types (#10201) @jlowe
- Fix docstrings alignment in
Framemethods (#10199) @galipremsagar - Fix cuco pair issue in hash join (#10195) @PointKernel
- Replace
daskgroupby.indexusages with.by(#10193) @galipremsagar - Add regex flags to strings extract function (#10192) @davidwendt
- Forward-merge branch-22.02 to branch-22.04 (#10191) @bdice
- Add CMake
installrule for tests (#10190) @ajschmidt8 - Unpin
dask&distributed(#10182) @galipremsagar - Add comments to explain test validation (#10176) @galipremsagar
- Reduce warnings in pytest output (#10168) @bdice
- Some consolidation of indexed frame methods (#10167) @vyasr
- Refactor isin implementations (#10165) @vyasr
- Faster struct row comparator (#10164) @devavret
- Refactor groupby::get_groups. (#10161) @bdice
- Deprecate
decimal_cols_as_floatin ORC reader (C++ layer) (#10152) @vuule - Replace
ccachewithsccache(#10146) @ajschmidt8 - Murmur3 hash kernel cleanup (#10143) @rwlee
- Deprecate
decimal_cols_as_floatin ORC reader (#10142) @galipremsagar - Run pyupgrade 2.31.0. (#10141) @bdice
- Remove
drop_nanfrom internalIndexedFrame._drop_na_rows. (#10140) @bdice - Change cudf::strings::find_multiple to return a lists column (#10134) @davidwendt
- Update cmake-format script for branch 22.04. (#10132) @bdice
- Accept r-value references in converttablefor_return(): (#10131) @mythrocks
- Remove the option to completely disable decimal128 columns in the ORC reader (#10127) @vuule
- Remove deprecated code (#10124) @vyasr
- Update gpu_utils.py to reflect current CUDA support. (#10113) @bdice
- Remove benchmarks suffix (#10112) @bdice
- Update cudf java binding version to 22.04.0-SNAPSHOT (#10084) @pxLi
- Remove unnecessary docker files. (#10069) @vyasr
- Limit benchmark iterations using environment variable (#10060) @karthikeyann
- Add timing chart for libcudf build metrics report page (#10038) @davidwendt
- JNI: Rewrite growBuffersAndRows to accelerate the HostColumnBuilder (#10025) @sperlingxx
- Reduce redundant code in CUDF JNI (#10019) @mythrocks
- Make snappy decompress check more efficient (#9995) @cheinger
- Remove deprecated method Series.set_index. (#9945) @bdice
- Implement a mixin for reductions (#9925) @vyasr
- JNI: Push back decimal utils from spark-rapids (#9907) @sperlingxx
- Add
assert_column_memory_*(#9882) @isVoid - Add CUDF_UNREACHABLE macro. (#9727) @bdice
- Upgrade
arrow&pyarrowto6.0.1(#9686) @galipremsagar
- C++
Published by GPUtester almost 4 years ago
https://github.com/rapidsai/cudf - v22.02.00
π¨ Breaking Changes
- ORC writer API changes for granular statistics (#10058) @mythrocks
decimal128Support forto/from_arrow(#9986) @codereport- Remove deprecated method
one_hot_encoding(#9977) @isVoid - Remove str.subword_tokenize (#9968) @VibhuJawa
- Remove deprecated
methodparameter frommergeandjoin. (#9944) @bdice - Remove deprecated method DataFrame.hash_columns. (#9943) @bdice
- Remove deprecated method Series.hash_encode. (#9942) @bdice
- Refactoring ceil/round/floor code for datetime64 types (#9926) @mayankanand007
- Introduce
nan_as_nullparameter forcudf.Index(#9893) @galipremsagar - Add regexflags parameter to strings replacere functions (#9878) @davidwendt
- Break tie for
topcategorical columns inSeries.describe(#9867) @isVoid - Add partitioning support in parquet writer (#9810) @devavret
- Move
drop_duplicates,drop_na,_gather,taketo IndexFrame and create their_base_indexcounterparts (#9807) @isVoid - Raise temporary error for
decimal128types in parquet reader (#9804) @galipremsagar - Change default
dtypeof all nulls column fromfloattoobject(#9803) @galipremsagar - Remove unused masked udf cython/c++ code (#9792) @brandon-b-miller
- Pick smallest decimal type with required precision in ORC reader (#9775) @vuule
- Add decimal128 support to Parquet reader and writer (#9765) @vuule
- Refactor TableTest assertion methods to a separate utility class (#9762) @jlowe
- Use cuFile direct device reads/writes by default in cuIO (#9722) @vuule
- Match pandas scalar result types in reductions (#9717) @brandon-b-miller
- Add parameters to control row group size in Parquet writer (#9677) @vuule
- Refactor bit counting APIs, introduce valid/null count functions, and split host/device side code for segmented counts. (#9588) @bdice
- Add support for
decimal128in cudf python (#9533) @galipremsagar - Implement
lists::index_of()to find positions in list rows (#9510) @mythrocks - Rewriting row/column conversions for Spark <-> cudf data conversions (#8444) @hyperbolic2346
π Bug Fixes
- Add check for negative stripe index in ORC reader (#10074) @vuule
- Update Java tests to expect DECIMAL128 from Arrow (#10073) @jlowe
- Avoid index materialization when
DataFrameis created with un-namedSeriesobjects (#10071) @galipremsagar - fix gcc 11 compilation errors (#10067) @rongou
- Fix
columnsordering issue in parquet reader (#10066) @galipremsagar - Fix dataframe setitem with
ndarraytypes (#10056) @galipremsagar - Remove implicit copy due to conversion from cudf::sizetype and sizet (#10045) @robertmaynard
- Include <optional> in headers that use std::optional (#10044) @robertmaynard
- Fix repr and concat of
StructColumn(#10042) @galipremsagar - Include row group level stats when writing ORC files (#10041) @vuule
- build.sh respects the
--build_metricsand--incl_cache_statsflags (#10035) @robertmaynard - Fix memory leaks in JNI native code. (#10029) @mythrocks
- Update JNI to use new arena mr constructor (#10027) @rongou
- Fix null check when comparing structs in
arg_minoperation of reduction/groupby (#10026) @ttnghia - Wrap CI script shell variables in quotes to fix local testing. (#10018) @bdice
- cudftestutil no longer propagates compiler flags to external users (#10017) @robertmaynard
- Remove
CUDA_DEVICE_CALLABLEmacro usage (#10015) @hyperbolic2346 - Add missing list filling header in meta.yaml (#10007) @devavret
- Fix
condarecipes forcustreamz&cudf_kafka(#10003) @ajschmidt8 - Fix matching regex word-boundary (\b) in strings replace (#9997) @davidwendt
- Fix null check when comparing structs in
minandmaxreduction/groupby operations (#9994) @ttnghia - Fix octal pattern matching in regex string (#9993) @davidwendt
decimal128Support forto/from_arrow(#9986) @codereport- Fix groupby shift/diff/fill after selecting from a
GroupBy(#9984) @shwina - Fix the overflow problem of decimal rescale (#9966) @sperlingxx
- Use default value for decimal precision in parquet writer when not specified (#9963) @devavret
- Fix cudf java build error. (#9958) @firestarman
- Use gpucimambaretry to install local artifacts. (#9951) @bdice
- Fix regression HostColumnVectorCore requiring native libs (#9948) @jlowe
- Rename aggregate_metadata in writer to fix name collision (#9938) @devavret
- Fixed issue with percentile_approx where output tdigests could have uninitialized data at the end. (#9931) @nvdbaranec
- Resolve racecheck errors in ORC kernels (#9916) @vuule
- Fix the java build after parquet partitioning support (#9908) @revans2
- Fix compilation of benchmark for parquet writer. (#9905) @bdice
- Fix a memcheck error in ORC writer (#9896) @vuule
- Introduce
nan_as_nullparameter forcudf.Index(#9893) @galipremsagar - Fix fallback to sort aggregation for grouping only hash aggregate (#9891) @abellina
- Add zlib to cudfjni link when using static libcudf library dependency (#9890) @jlowe
- TimedeltaIndex constructor raises an AttributeError. (#9884) @skirui-source
- Fix cudf.Scalar string datetime construction (#9875) @brandon-b-miller
- Load libcufile.so with RTLD_NODELETE flag (#9872) @vuule
- Break tie for
topcategorical columns inSeries.describe(#9867) @isVoid - Fix null handling for structs
minandarg_minin groupby, groupby scan, reduction, and inclusive_scan (#9864) @ttnghia - Add one-level list encoding support in parquet reader (#9848) @PointKernel
- Fix an out-of-bounds read in validity copying in contiguous_split. (#9842) @nvdbaranec
- Fix join of MultiIndex to Index with one column and overlapping name. (#9830) @vyasr
- Fix caching in
Series.applymap(#9821) @brandon-b-miller - Enforce boolean
ascendingfor dask-cudfsort_values(#9814) @charlesbluca - Fix ORC writer crash with empty input columns (#9808) @vuule
- Change default
dtypeof all nulls column fromfloattoobject(#9803) @galipremsagar - Load native dependencies when Java ColumnView is loaded (#9800) @jlowe
- Fix dtype-argument bug in daskcudf readcsv (#9796) @rjzamora
- Fix overflow for min calculation in strings::from_timestamps (#9793) @revans2
- Fix memory error due to lambda return type deduction limitation (#9778) @karthikeyann
- Revert regex $/EOL end-of-string new-line special case handling (#9774) @davidwendt
- Fix missing streams (#9767) @karthikeyann
- Fix makeemptyscalarlike on listtype (#9759) @sperlingxx
- Update cmake and conda to 22.02 (#9746) @devavret
- Fix out-of-bounds memory write in decimal128-to-string conversion (#9740) @davidwendt
- Match pandas scalar result types in reductions (#9717) @brandon-b-miller
- Fix regex non-multiline EOL/$ matching strings ending with a new-line (#9715) @davidwendt
- Fixed build by adding more checks for int8, int16 (#9707) @razajafri
- Fix
nullhandling whenbooleandtype is passed (#9691) @galipremsagar - Fix stream usage in
segmented_gather()(#9679) @mythrocks
π Documentation
- Update
decimaldtypes related docs entries (#10072) @galipremsagar - Fix regex doc describing hexadecimal escape characters (#10009) @davidwendt
- Fix cudf compilation instructions. (#9956) @esoha-nvidia
- Fix see also links for IO APIs (#9895) @galipremsagar
- Fix build instructions for libcudf doxygen (#9837) @davidwendt
- Fix some doxygen warnings and add missing documentation (#9770) @karthikeyann
- update cuda version in local build (#9736) @karthikeyann
- Fix doxygen for enum types in libcudf (#9724) @davidwendt
- Spell check fixes (#9682) @karthikeyann
- Fix links in C++ Developer Guide. (#9675) @bdice
π New Features
- Remove libcudacxx patch needed for nvcc 11.4 (#10057) @robertmaynard
- Allow CuPy 10 (#10048) @jakirkham
- Add in support for NULLLOGICALAND and NULLLOGICALOR binops (#10016) @revans2
- Add
groupby.transform(only support for aggregations) (#10005) @shwina - Add partitioning support to Parquet chunked writer (#10000) @devavret
- Add jni for sequences (#9972) @wbo4958
- Java bindings for mixed left, inner, and full joins (#9941) @jlowe
- Java bindings for JSON reader support (#9940) @wbo4958
- Enable transpose for string columns in cudf python (#9937) @galipremsagar
- Support structs for
cudf::containswith column/scalar input (#9929) @ttnghia - Implement mixed equality/conditional joins (#9917) @vyasr
- Add cudf::strings::extract_all API (#9909) @davidwendt
- Implement JNI for
cudf::scatterAPIs (#9903) @ttnghia - JNI: Function to copy and set validity from bool column. (#9901) @mythrocks
- Add dictionary support to cudf::copyifelse (#9887) @davidwendt
- add run_benchmarks target for running benchmarks with json output (#9879) @karthikeyann
- Add regexflags parameter to strings replacere functions (#9878) @davidwendt
- Addsuffix and addprefix for DataFrames and Series (#9846) @mayankanand007
- Add JNI for
cudf::drop_duplicates(#9841) @ttnghia - Implement per-list sequence (#9839) @ttnghia
- adding
series.transpose(#9835) @mayankanand007 - Adding support for
Series.autocorr(#9833) @mayankanand007 - Support round operation on datetime64 datatypes (#9820) @mayankanand007
- Add partitioning support in parquet writer (#9810) @devavret
- Raise temporary error for
decimal128types in parquet reader (#9804) @galipremsagar - Add decimal128 support to Parquet reader and writer (#9765) @vuule
- Optimize
groupby::scan(#9754) @PointKernel - Add sample JNI API (#9728) @res-life
- Support
minandmaxin inclusive scan for structs (#9725) @ttnghia - Add
firstandlastmethod toIndexedFrame(#9710) @isVoid - Support
minandmaxreduction for structs (#9697) @ttnghia - Add parameters to control row group size in Parquet writer (#9677) @vuule
- Run compute-sanitizer in nightly build (#9641) @karthikeyann
- Implement Series.datetime.floor (#9571) @skirui-source
- ceil/floor for
DatetimeIndex(#9554) @mayankanand007 - Add support for
decimal128in cudf python (#9533) @galipremsagar - Implement
lists::index_of()to find positions in list rows (#9510) @mythrocks - custreamz oauth callback for kafka (librdkafka) (#9486) @jdye64
- Add Pearson correlation for sort groupby (python) (#9166) @skirui-source
- Interchange dataframe protocol (#9071) @iskode
- Rewriting row/column conversions for Spark <-> cudf data conversions (#8444) @hyperbolic2346
π οΈ Improvements
- Prepare upload scripts for Python 3.7 removal (#10092) @Ethyling
- Simplify custreamz and cudf_kafka recipes files (#10065) @Ethyling
- ORC writer API changes for granular statistics (#10058) @mythrocks
- Remove python constraints in cutreamz and cudf_kafka recipes (#10052) @Ethyling
- Unpin
daskanddistributedin CI (#10028) @galipremsagar - Add
_from_column_like_selffactory (#10022) @isVoid - Replace custom CUDA bindings previously provided by RMM with official CUDA Python bindings (#10008) @shwina
- Use
cuda::std::is_arithmeticincudf::is_numerictrait. (#9996) @bdice - Clean up CUDA stream use in cuIO (#9991) @vuule
- Use addressed-ordered first fit for the pinned memory pool (#9989) @rongou
- Add strings tests to transpose_test.cpp (#9985) @davidwendt
- Use gpucimambaretry on Java CI. (#9983) @bdice
- Remove deprecated method
one_hot_encoding(#9977) @isVoid - Minor cleanup of unused Python functions (#9974) @vyasr
- Use new efficient partitioned parquet writing in cuDF (#9971) @devavret
- Remove str.subword_tokenize (#9968) @VibhuJawa
- Forward-merge branch-21.12 to branch-22.02 (#9947) @bdice
- Remove deprecated
methodparameter frommergeandjoin. (#9944) @bdice - Remove deprecated method DataFrame.hash_columns. (#9943) @bdice
- Remove deprecated method Series.hash_encode. (#9942) @bdice
- use ninja in java ci build (#9933) @rongou
- Add build-time publish step to cpu build script (#9927) @davidwendt
- Refactoring ceil/round/floor code for datetime64 types (#9926) @mayankanand007
- Remove various unused functions (#9922) @vyasr
- Raise in
queryif dtype is not supported (#9921) @brandon-b-miller - Add missing imports tests (#9920) @Ethyling
- Spark Decimal128 hashing (#9919) @rwlee
- Replace
thrust/std::getwith structured bindings (#9915) @codereport - Upgrade thrust version to 1.15 (#9912) @robertmaynard
- Remove conda envs for CUDA 11.0 and 11.2. (#9910) @bdice
- Return count of set bits from inplacebitmaskand. (#9904) @bdice
- Use dynamic nullate for join hasher and equality comparator (#9902) @davidwendt
- Update ucx-py version on release using rvc (#9897) @Ethyling
- Remove
IncludeCategoriesfrom.clang-format(#9876) @codereport - Support statically linking CUDA runtime for Java bindings (#9873) @jlowe
- Add
clang-tidyto libcudf (#9860) @codereport - Remove deprecated methods from Java Table class (#9853) @jlowe
- Add test for map column metadata handling in ORC writer (#9852) @vuule
- Use pandas
to_offsetto parse frequency string indate_range(#9843) @isVoid - add templated benchmark with fixture (#9838) @karthikeyann
- Use list of column inputs for
apply_boolean_mask(#9832) @isVoid - Added a few more tests for Decimal to String cast (#9818) @razajafri
- Run doctests. (#9815) @bdice
- Avoid overflow for fixed_point round (#9809) @sperlingxx
- Move
drop_duplicates,drop_na,_gather,taketo IndexFrame and create their_base_indexcounterparts (#9807) @isVoid - Use vector factories for host-device copies. (#9806) @bdice
- Refactor host device macros (#9797) @vyasr
- Remove unused masked udf cython/c++ code (#9792) @brandon-b-miller
- Allow custom sort functions for dask-cudf
sort_values(#9789) @charlesbluca - Improve build time of libcudf iterator tests (#9788) @davidwendt
- Copy Java native dependencies directly into classpath (#9787) @jlowe
- Add decimal types to cuIO benchmarks (#9776) @vuule
- Pick smallest decimal type with required precision in ORC reader (#9775) @vuule
- Avoid overflow for
fixed_pointcudf::castand performance optimization (#9772) @codereport - Use CTAD with Thrust function objects (#9768) @codereport
- Refactor TableTest assertion methods to a separate utility class (#9762) @jlowe
- Use Java classloader to find test resources (#9760) @jlowe
- Allow cast decimal128 to string and add tests (#9756) @razajafri
- Load balance optimization for contiguous_split (#9755) @nvdbaranec
- Consolidate and improve
reset_index(#9750) @isVoid - Update to UCX-Py 0.24 (#9748) @pentschev
- Skip cufile tests in JNI build script (#9744) @pxLi
- Enable string to decimal 128 cast (#9742) @razajafri
- Use stop instead of stop_. (#9735) @bdice
- Forward-merge branch-21.12 to branch-22.02 (#9730) @bdice
- Improve cmake format script (#9723) @vyasr
- Use cuFile direct device reads/writes by default in cuIO (#9722) @vuule
- Add directory-partitioned data support to cudf.read_parquet (#9720) @rjzamora
- Use stream allocator adaptor for hash join table (#9704) @PointKernel
- Update check for inf/nan strings in libcudf float conversion to ignore case (#9694) @davidwendt
- Update cudf JNI to 22.02.0-SNAPSHOT (#9681) @pxLi
- Replace cudf's concurrentorderedmap with cuco::static_map in semi/anti joins (#9666) @vyasr
- Some improvements to
parse_decimalfunction and bindings foris_fixed_point(#9658) @razajafri - Add utility to format ninja-log build times (#9631) @davidwendt
- Allow runtime has_nulls parameter for row operators (#9623) @davidwendt
- Use fsspec.parquet for improved read_parquet performance from remote storage (#9589) @rjzamora
- Refactor bit counting APIs, introduce valid/null count functions, and split host/device side code for segmented counts. (#9588) @bdice
- Use List of Columns as Input for
drop_nulls,gatheranddrop_duplicates(#9558) @isVoid - Simplify merge internals and reduce overhead (#9516) @vyasr
- Add
structgeneration support in datagenerator & fuzz tests (#9180) @galipremsagar - Simplify write_csv by removing unnecessary writer/impl classes (#9089) @cwharris
- C++
Published by GPUtester about 4 years ago
https://github.com/rapidsai/cudf - v21.12.02
v21.12.02
- C++
Published by GPUtester about 4 years ago
https://github.com/rapidsai/cudf - v21.12.01
v21.12.01
- C++
Published by GPUtester about 4 years ago
https://github.com/rapidsai/cudf - v21.12.00
π¨ Breaking Changes
- Update
bitmask_andandbitmask_orto return a pair of resulting mask and count of unset bits (#9616) @PointKernel - Remove sizeof and standardize on memory_usage (#9544) @vyasr
- Add support for single-line regex anchors ^/$ in contains_re (#9482) @davidwendt
- Refactor sorting APIs (#9464) @vyasr
- Update Java nvcomp JNI bindings to nvcomp 2.x API (#9384) @jbrennan333
- Support Python UDFs written in terms of rows (#9343) @brandon-b-miller
- JNI: Support nested types in ORC writer (#9334) @firestarman
- Optionally nullify out-of-bounds indices in segmented_gather(). (#9318) @mythrocks
- Refactor cuIO timestamp processing with
cuda::std::chrono(#9278) @PointKernel - Various internal MultiIndex improvements (#9243) @vyasr
π Bug Fixes
- Fix read_parquet bug for bytes input (#9669) @rjzamora
- Use
_gatherinternal forsort_*(#9668) @isVoid - Fix behavior of equals for non-DataFrame Frames and add tests. (#9653) @vyasr
- Dont recompute output size if it is already available (#9649) @abellina
- Fix read_parquet bug for extended dtypes from remote storage (#9638) @rjzamora
- add const when getting data from a JNI data wrapper (#9637) @wjxiz1992
- Fix debrotli issue on CUDA 11.5 (#9632) @vuule
- Use std::size_t when computing join output size (#9626) @jlowe
- Fix
usecolsparameter handling indask_cudf.read_csv(#9618) @galipremsagar - Add support for string
'nan', 'inf' & '-inf'values while type-casting tofloat(#9613) @galipremsagar - Avoid passing NativeFileDatasource to pyarrow in read_parquet (#9608) @rjzamora
- Fix test failure with cuda 11.5 in rowbitcount tests. (#9581) @nvdbaranec
- Correct LIBCUDACXXCUDACC_VER value computation (#9579) @robertmaynard
- Increase max RLE stream size estimate to avoid potential overflows (#9568) @vuule
- Fix edge case in tdigest scalar generation for groups containing all nulls. (#9551) @nvdbaranec
- Fix pytests failing in
cuda-11.5environment (#9547) @galipremsagar - compile libnvcomp with PTDS if requested (#9540) @jbrennan333
- Fix
segmented_gather()for null LIST rows (#9537) @mythrocks - Deprecate DataFrame.labelencoding, use private _labelencoding method internally. (#9535) @bdice
- Fix several test and benchmark issues related to bitmask allocations. (#9521) @nvdbaranec
- Fix for inserting duplicates in groupby result cache (#9508) @karthikeyann
- Fix mismatched types error in clip() when using non int64 numeric types (#9498) @davidwendt
- Match conda pinnings for style checks (revert part of #9412, #9433). (#9490) @bdice
- Make sure all dask-cudf supported aggs are handled in
_tree_node_agg(#9487) @charlesbluca - Resolve
hash_columnsFutureWarningindask_cudf(#9481) @pentschev - Add fixed point to AllTypes in libcudf unit tests (#9472) @karthikeyann
- Fix regex handling of embedded null characters (#9470) @davidwendt
- Fix memcheck error in copy-if-else (#9467) @davidwendt
- Fix bug in daskcudf.readparquet for index=False (#9453) @rjzamora
- Preserve the decimal scale when creating a default scalar (#9449) @revans2
- Push down parent nulls when flattening nested columns. (#9443) @mythrocks
- Fix memcheck error in gtest SegmentedGatherTest/GatherSliced (#9442) @davidwendt
- Revert "Fix quantile division / partition handling for dask-cudf sort⦠(#9438) @charlesbluca
- Allow int-like objects for the
decimalsargument inround(#9428) @shwina - Fix stream compaction's
drop_duplicatesAPI to use stable sort (#9417) @ttnghia - Skip Comparing Uniform Window Results in Var/std Tests (#9416) @isVoid
- Fix
StructColumn.to_pandastype handling issues (#9388) @galipremsagar - Correct issues in the build dir cudf-config.cmake (#9386) @robertmaynard
- Fix Java table partition test to account for non-deterministic ordering (#9385) @jlowe
- Fix timestamp truncation/overflow bugs in orc/parquet (#9382) @PointKernel
- Fix the crash in stats code (#9368) @devavret
- Make Series.hash_encode results reproducible. (#9366) @bdice
- Fix libcudf compile warnings on debug 11.4 build (#9360) @davidwendt
- Fail gracefully when compiling python UDFs that attempt to access columns with unsupported dtypes (#9359) @brandon-b-miller
- Set pass_filenames: false in mypy pre-commit configuration. (#9349) @bdice
- Fix cudf_assert in cudf::io::orc::gpu::gpuDecodeOrcColumnData (#9348) @davidwendt
- Fix memcheck error in groupby-tdigest getscalarminmax (#9339) @davidwendt
- Optimizations for
cudf.concatwhenaxis=1(#9333) @galipremsagar - Use f-string in join helper warning message. (#9325) @bdice
- Avoid casting to list or struct dtypes in daskcudf.readparquet (#9314) @rjzamora
- Fix null count in statistics for parquet (#9303) @devavret
- Potential overflow of
decimal32when casting toint64_t(#9287) @codereport - Fix quantile division / partition handling for dask-cudf sort on null dataframes (#9259) @charlesbluca
- Updating cudf version also updates rapids cmake branch (#9249) @robertmaynard
- Implement
one_hot_encodingin libcudf and bind to python (#9229) @isVoid - BUG FIX: CSV Writer ignores the header parameter when no metadata is provided (#8740) @skirui-source
π Documentation
- Update Documentation to use
TYPED_TEST_SUITE(#9654) @codereport - Add dedicated page for
StringHandlingin python docs (#9624) @galipremsagar - Update docstring of
DataFrame.merge(#9572) @galipremsagar - Use raw strings to avoid SyntaxErrors in parsed docstrings. (#9526) @bdice
- Add example to docstrings in
rolling.apply(#9522) @isVoid - Update help message to escape quotes in ./build.sh --cmake-args. (#9494) @bdice
- Improve Python docstring formatting. (#9493) @bdice
- Update table of I/O supported types (#9476) @vuule
- Document invalid regex patterns as undefined behavior (#9473) @davidwendt
- Miscellaneous documentation fixes to
cudf(#9471) @galipremsagar - Fix many documentation errors in libcudf. (#9355) @karthikeyann
- Fixing SubwordTokenizer docs issue (#9354) @mayankanand007
- Improved deprecation warnings. (#9347) @bdice
- doc reorder mr, stream to stream, mr (#9308) @karthikeyann
- Deprecate method parameters to DataFrame.join, DataFrame.merge. (#9291) @bdice
- Added deprecation warning for
.label_encoding()(#9289) @mayankanand007
π New Features
- Enable Series.divide and DataFrame.divide (#9630) @vyasr
- Update
bitmask_andandbitmask_orto return a pair of resulting mask and count of unset bits (#9616) @PointKernel - Add handling of mixed numeric types in
to_dlpack(#9585) @galipremsagar - Support re.Pattern object for pat arg in str.replace (#9573) @davidwendt
- Add JNI for
lists::drop_list_duplicateswith keys-values input column (#9553) @ttnghia - Support structs column in
min,max,argminandargmaxgroupby aggregate() and scan() (#9545) @ttnghia - Move libcudacxx to use
rapids_cpmand use newer versions (#9539) @robertmaynard - Add scan min/max support for chrono types to libcudf reduction-scan (not groupby scan) (#9518) @davidwendt
- Support
args=inapply(#9514) @brandon-b-miller - Add groupby scan min/max support for strings values (#9502) @davidwendt
- Add list output option to character_ngrams() function (#9499) @davidwendt
- More granular column selection in ORC reader (#9496) @vuule
- add min_periods, ddof to groupby covariance, & correlation aggregation (#9492) @karthikeyann
- Implement Series.datetime.floor (#9488) @skirui-source
- Enable linting of CMake files using pre-commit (#9484) @vyasr
- Add support for single-line regex anchors ^/$ in contains_re (#9482) @davidwendt
- Augment
order_byto Accept a List ofnull_precedence(#9455) @isVoid - Add format API for list column of strings (#9454) @davidwendt
- Enable Datetime/Timedelta dtypes in Masked UDFs (#9451) @brandon-b-miller
- Add cudf python groupby.diff (#9446) @karthikeyann
- Implement
lists::stable_sort_listsfor stable sorting of elements within each row of lists column (#9425) @ttnghia - add ctest memcheck using cuda-sanitizer (#9414) @karthikeyann
- Support Unary Operations in Masked UDF (#9409) @isVoid
- Move Several Series Function to Frame (#9394) @isVoid
- MD5 Python hash API (#9390) @bdice
- Add cudf strings is_title API (#9380) @davidwendt
- Enable casting to int64, uint64, and double in AST code. (#9379) @vyasr
- Add support for writing ORC with map columns (#9369) @vuule
- extractlistelements() with column_view indices (#9367) @mythrocks
- Reimplement
lists::drop_list_duplicatesfor keys-values lists columns (#9345) @ttnghia - Support Python UDFs written in terms of rows (#9343) @brandon-b-miller
- JNI: Support nested types in ORC writer (#9334) @firestarman
- Optionally nullify out-of-bounds indices in segmented_gather(). (#9318) @mythrocks
- Add shallow hash function and shallow equality comparison for column_view (#9312) @karthikeyann
- Add CudaMemoryBuffer for cudaMalloc memory using RMM cudamemoryresource (#9311) @rongou
- Add parameters to control row index stride and stripe size in ORC writer (#9310) @vuule
- Add
na_positionparam to dask-cudfsort_values(#9264) @charlesbluca - Add
ascendingparameter for dask-cudfsort_values(#9250) @charlesbluca - New array conversion methods (#9236) @vyasr
- Series
applymethod backed by masked UDFs (#9217) @brandon-b-miller - Grouping by frequency and resampling (#9178) @shwina
- Pure-python masked UDFs (#9174) @brandon-b-miller
- Add Covariance, Pearson correlation for sort groupby (libcudf) (#9154) @karthikeyann
- Add
calendrical_month_sequencein c++ anddate_rangein python (#8886) @shwina
π οΈ Improvements
- Followup to PR 9088 comments (#9659) @cwharris
- Update cuCollections to version that supports installed libcudacxx (#9633) @robertmaynard
- Add
11.5dev.yml tocudf(#9617) @galipremsagar - Add
xfailfor parquet reader11.5issue (#9612) @galipremsagar - remove deprecated Rmm.initialize method (#9607) @rongou
- Use HostColumnVectorCore for child columns in JCudfSerialization.unpackHostColumnVectors (#9596) @sperlingxx
- Set RMM pool to a fixed size in JNI (#9583) @rongou
- Use nvCOMP for Snappy compression/decompression (#9582) @vuule
- Build CUDA version agnostic packages for dask-cudf (#9578) @Ethyling
- Fixed tests warning: "TYPEDTESTCASE is deprecated, please use TYPEDTESTSUITE" (#9574) @ttnghia
- Enable CMake format in CI and fix style (#9570) @vyasr
- Add NVTX Start/End Ranges to JNI (#9563) @abellina
- Add librdkafka and python-confluent-kafka to dev conda environments s⦠(#9562) @jdye64
- Add offsetsbegin/end() to stringscolumn_view (#9559) @davidwendt
- remove alignment options for RMM jni (#9550) @rongou
- Add axis parameter passthrough to
DataFrameandSeriestake for pandas API compatibility (#9549) @dantegd - Remove sizeof and standardize on memory_usage (#9544) @vyasr
- Adds cudaProfilerStart/cudaProfilerStop in JNI api (#9543) @abellina
- Generalize comparison binary operations (#9542) @vyasr
- Expose APIs to wrap CUDA or RMM allocations with a Java device buffer instance (#9538) @jlowe
- Add scan sum support for duration types to libcudf (#9536) @davidwendt
- Force inlining to improve AST performance (#9530) @vyasr
- Generalize some more indexed frame methods (#9529) @vyasr
- Add Java bindings for rolling window stddev aggregation (#9527) @razajafri
- catch rmm::outofmemory exceptions in jni (#9525) @rongou
- Add an overload of
make_empty_columnwithtype_idparameter (#9524) @ttnghia - Accelerate conditional inner joins with larger right tables (#9523) @vyasr
- Initial pass of generalizing
decimalsupport incudfpython layer (#9517) @galipremsagar - Cleanup for flattening nested columns (#9509) @rwlee
- Enable running tests using RMM arena and async memory resources (#9506) @rongou
- Remove dependency on six. (#9495) @bdice
- Cleanup some libcudf strings gtests (#9489) @davidwendt
- Rename strings/arraytests.cu to strings/arraytests.cpp (#9480) @davidwendt
- Refactor sorting APIs (#9464) @vyasr
- Implement DataFrame.hashvalues, deprecate DataFrame.hashcolumns. (#9458) @bdice
- Deprecate Series.hash_encode. (#9457) @bdice
- Update
condarecipes for Enhanced Compatibility effort (#9456) @ajschmidt8 - Small clean up to simplify column selection code in ORC reader (#9444) @vuule
- add missing stream to scalar.is_valid() wherever stream is available (#9436) @karthikeyann
- Adds Deprecation Warnings to
one_hot_encodingand Implementget_dummieswith Cython API (#9435) @isVoid - Update pre-commit hook URLs. (#9433) @bdice
- Remove pyarrow import in
dask_cudf.io.parquet(#9429) @charlesbluca - Miscellaneous improvements for UDFs (#9422) @isVoid
- Use pre-commit for CI (#9412) @vyasr
- Update to UCX-Py 0.23 (#9407) @pentschev
- Expose OutOfBoundsPolicy in JNI for Table.gather (#9406) @abellina
- Improvements to tdigest aggregation code. (#9403) @nvdbaranec
- Add Java API to deserialize a table to host columns (#9402) @jlowe
- Frame copy to use class instead of type() (#9397) @madsbk
- Change all DeprecationWarnings to FutureWarning. (#9392) @bdice
- Update Java nvcomp JNI bindings to nvcomp 2.x API (#9384) @jbrennan333
- Add IndexedFrame class and move SingleColumnFrame to a separate module (#9378) @vyasr
- Support Arrow NativeFile and PythonFile for remote ORC storage (#9377) @rjzamora
- Use Arrow PythonFile for remote CSV storage (#9376) @rjzamora
- Add multi-threaded writing to GDS writes (#9372) @devavret
- Miscellaneous column cleanup (#9370) @vyasr
- Use single kernel to extract all groups in cudf::strings::extract (#9358) @davidwendt
- Consolidate binary ops into
Frame(#9357) @isVoid - Move rank scan implementations from scaninclusive.cu to rankscan.cu (#9351) @davidwendt
- Remove usage of deprecated thrust::hostspacetag. (#9350) @bdice
- Use Default Memory Resource for Temporaries in
reduction.cpp(#9344) @isVoid - Fix Cython compilation warnings. (#9327) @bdice
- Fix some unused variable warnings in libcudf (#9326) @davidwendt
- Use optional-iterator for copy-if-else kernel (#9324) @davidwendt
- Remove Table class (#9315) @vyasr
- Unpin
daskanddistributedin CI (#9307) @galipremsagar - Add optional-iterator support to indexalator (#9306) @davidwendt
- Consolidate more methods in Frame (#9305) @vyasr
- Add Arrow-NativeFile and PythonFile support to readparquet and readcsv in cudf (#9304) @rjzamora
- Pin mypy in .pre-commit-config.yaml to match conda environment pinning. (#9300) @bdice
- Use gather.hpp when gather-map exists in device memory (#9299) @davidwendt
- Fix Automerger for
Branch-21.12frombranch-21.10(#9285) @galipremsagar - Refactor cuIO timestamp processing with
cuda::std::chrono(#9278) @PointKernel - Change strings copyifelse to use optional-iterator instead of pair-iterator (#9266) @davidwendt
- Update cudf java bindings to 21.12.0-SNAPSHOT (#9248) @pxLi
- Various internal MultiIndex improvements (#9243) @vyasr
- Add detail interface for
splitandslice(table_view), refactors both function withhost_span(#9226) @isVoid - Refactor MD5 implementation. (#9212) @bdice
- Update groupby resultcache to allow sharing intermediate results based on columnview instead of requests. (#9195) @karthikeyann
- Use nvcomp's snappy decompressor in avro reader (#9181) @devavret
- Add
isocalendarAPI support (#9169) @marlenezw - Simplify read_json by removing unnecessary reader/impl classes (#9088) @cwharris
- Simplify read_csv by removing unnecessary reader/impl classes (#9041) @cwharris
- Refactor hash join with cuCollections multimap (#8934) @PointKernel
- C++
Published by GPUtester about 4 years ago
https://github.com/rapidsai/cudf - v21.10.01
v21.10.01
- C++
Published by GPUtester over 4 years ago
https://github.com/rapidsai/cudf - v21.10.00
π¨ Breaking Changes
- Remove Cython APIs for table view generation (#9199) @vyasr
- Upgrade
pandasversion incudf(#9147) @galipremsagar - Make AST operators nullable (#9096) @vyasr
- Remove the option to pass data types as strings to
read_csvandread_json(#9079) @vuule - Update JNI java CSV APIs to not use deprecated API (#9066) @revans2
- Support additional format specifiers in from_timestamps (#9047) @davidwendt
- Expose expression base class publicly and simplify public AST API (#9045) @vyasr
- Add support for struct type in ORC writer (#9025) @vuule
- Remove aliases of various api.types APIs from utils.dtypes. (#9011) @vyasr
- Java bindings for conditional join output sizes (#9002) @jlowe
- Move compute_column API out of ast namespace (#8957) @vyasr
cudf.dtypefunction (#8949) @shwina- Refactor Frame reductions (#8944) @vyasr
- Add nested column selection to parquet reader (#8933) @devavret
- JNI Aggregation Type Changes (#8919) @revans2
- Add groupbyaggregation and groupbyscan_aggregation classes and force their usage. (#8906) @nvdbaranec
- Expand CSV and JSON reader APIs to accept
dtypesas a vector or map ofdata_typeobjects (#8856) @vuule - Change cudf docs theme to pydata theme (#8746) @galipremsagar
- Enable compiled binary ops in libcudf, python and java (#8741) @karthikeyann
- Make groupby transform-like op order match original data order (#8720) @isVoid
π Bug Fixes
fixed_pointcudf::groupbyformeanaggregation (#9296) @codereport- Fix
interleave_columnswhen the input string lists column having empty child column (#9292) @ttnghia - Update nvcomp to include fixes for installation of headers (#9276) @devavret
- Fix Java column leak in testParquetWriteMap (#9271) @jlowe
- Fix call to thrust::reducebykey in argmin/argmax libcudf groupby (#9263) @davidwendt
- Fixing empty input to getMapValue crashing (#9262) @hyperbolic2346
- Fix duplicate names issue in
MultiIndex.deserialize(#9258) @galipremsagar Dataframe.sort_indexoptimizations (#9238) @galipremsagar- Temporarily disabling problematic test in parquet writer (#9230) @devavret
- Explicitly disable groupby on unsupported key types. (#9227) @mythrocks
- Fix
gatherfor sliced input structs column (#9218) @ttnghia - Fix JNI code for left semi and anti joins (#9207) @jlowe
- Only install thrust when using a non 'system' version (#9206) @robertmaynard
- Remove zlib from libcudf public CMake dependencies (#9204) @robertmaynard
- Fix out-of-bounds memory read in orc gpuEncodeOrcColumnData (#9196) @davidwendt
- Fix
gather()forSTRUCTinputs with no nulls in members. (#9194) @mythrocks - getcucollections properly uses rapidscpm_find (#9189) @robertmaynard
- rapids-export correctly reference build code block and doc strings (#9186) @robertmaynard
- Fix logic while parsing the sum statistic for numerical orc columns (#9183) @ayushdg
- Add handling for nulls in
dask_cudf.sorting.quantile_divisions(#9171) @charlesbluca - Approximate overflow detection in ORC statistics (#9163) @vuule
- Use decimal precision metadata when reading from parquet files (#9162) @shwina
- Fix variable name in Java build script (#9161) @jlowe
- Import rapids-cmake modules using the correct cmake variable. (#9149) @robertmaynard
- Fix conditional joins with empty left table (#9146) @vyasr
- Fix joining on indexes with duplicate level names (#9137) @shwina
- Fixes missing child column name in dtype while reading ORC file. (#9134) @rgsl888prabhu
- Apply type metadata after column is slice-copied (#9131) @isVoid
- Fix a bug: innerjoinsize return zero if build table is empty (#9128) @PointKernel
- Fix multi hive-partition parquet reading in dask-cudf (#9122) @rjzamora
- Support null literals in expressions (#9117) @vyasr
- Fix cudf::hash_join output size for struct joins (#9107) @jlowe
- Import fix (#9104) @shwina
- Fix cudf::strings::isfixedpoint checking of overflow for decimal32 (#9093) @davidwendt
- Fix branchstack calculation in `rowbit_count()` (#9076) @mythrocks
- Fetch rapids-cmake to work around cuCollection cmake issue (#9075) @jlowe
- Fix compilation errors in groupby benchmarks. (#9072) @nvdbaranec
- Preserve float16 upscaling (#9069) @galipremsagar
- Fix memcheck read error in libcudf contiguous_split (#9067) @davidwendt
- Add support for reading ORC file with no row group index (#9060) @rgsl888prabhu
- Various multiindex related fixes (#9036) @shwina
- Avoid rebuilding cython in build.sh (#9034) @brandon-b-miller
- Add support for percentile dispatch in
dask_cudf(#9031) @galipremsagar - cudf resolve nvcc 11.0 compiler crashes during codegen (#9028) @robertmaynard
- Fetch correct grouping keys
aggof dask groupby (#9022) @galipremsagar - Allow
where()to work with a Series andother=cudf.NA(#9019) @sarahyurick - Use correct index when returning Series from
GroupBy.apply()(#9016) @charlesbluca - Fix
Dataframeindexer setitem when array is passed (#9006) @galipremsagar - Fix ORC reading of files with struct columns that have null values (#9005) @vuule
- Ensure JNI native libraries load when CompiledExpression loads (#8997) @jlowe
- Fix memory read error in getdremeldata in page_enc.cu (#8995) @davidwendt
- Fix memory write error in getlistchildtolistrowmapping utility (#8994) @davidwendt
- Fix debug compile error for csv_test.cpp (#8981) @davidwendt
- Fix memory read/write error in concatenatelistsignore_null (#8978) @davidwendt
- Fix concatenation of
cudf.RangeIndex(#8970) @galipremsagar - Java conditional joins should not require matching column counts (#8955) @jlowe
- Fix concatenate empty structs (#8947) @sperlingxx
- Fix cuda-memcheck errors for some libcudf functions (#8941) @davidwendt
- Apply series name to result of
SeriesGroupby.apply()(#8939) @charlesbluca cdef packed_columnsascppclassinstead ofstruct(#8936) @charlesbluca- Inserting a
cudf.NAinto a DataFrame (#8923) @sarahyurick - Support casting with Pandas dtype aliases (#8920) @sarahyurick
- Allow
sort_valuesto accept samekindvalues as Pandas (#8912) @sarahyurick - Enable casting to pandas nullable dtypes (#8889) @brandon-b-miller
- Fix libcudf memory errors (#8884) @karthikeyann
- Throw KeyError when accessing field from struct with nonexistent key (#8880) @NV-jpt
- replace auto with auto& ref for cast<&> (#8866) @karthikeyann
- Add missing include<optional> in binops (#8864) @karthikeyann
- Fix
select_dtypesto work when non-class dtypes present in dataframe (#8849) @sarahyurick - Re-enable JSON tests (#8843) @vuule
- Support header with embedded delimiter in csv writer (#8798) @davidwendt
π Documentation
- Add IO docs page in
cudfdocumentation (#9145) @galipremsagar - use correct namespace in cuio code examples (#9037) @cwharris
- Restructuring
Contributing doc(#9026) @iskode - Update stable version in readme (#9008) @galipremsagar
- Add spans and more include guidelines to libcudf developer guide (#8931) @harrism
- Update Java build instructions to mention Arrow S3 and Docker (#8867) @jlowe
- List GDS-enabled formats in the docs (#8805) @vuule
- Change cudf docs theme to pydata theme (#8746) @galipremsagar
π New Features
- Revert "Add shallow hash function and shallow equality comparison for column_view (#9185)" (#9283) @karthikeyann
- Align
DataFrame.applysignature with pandas (#9275) @brandon-b-miller - Add struct type support for
drop_list_duplicates(#9202) @ttnghia - support CUDA async memory resource in JNI (#9201) @rongou
- Add shallow hash function and shallow equality comparison for column_view (#9185) @karthikeyann
- Superimpose null masks for STRUCT columns. (#9144) @mythrocks
- Implemented bindings for
ceiltimestamp operation (#9141) @shaneding - Adding MAP type support for ORC Reader (#9132) @rgsl888prabhu
- Implement
interleave_columnsfor lists with arbitrary nested type (#9130) @ttnghia - Add python bindings to fixed-size window and groupby
rolling.var,rolling.std(#9097) @isVoid - Make AST operators nullable (#9096) @vyasr
- Java bindings for approx_percentile (#9094) @andygrove
- Add
dseries.struct.explode(#9086) @isVoid - Add support for BaseIndexer in Rolling APIs (#9085) @galipremsagar
- Remove the option to pass data types as strings to
read_csvandread_json(#9079) @vuule - Add handling for nested dicts in dask-cudf groupby (#9054) @charlesbluca
- Added Series.dt.isquarterstart and Series.dt.isquarterend (#9046) @TravisHester
- Support nested types for nth_element reduction (#9043) @sperlingxx
- Update sort groupby to use non-atomic operation (#9035) @karthikeyann
- Add support for struct type in ORC writer (#9025) @vuule
- Implement
interleave_columnsfor structs columns (#9012) @ttnghia - Add groupby first and last aggregations (#9004) @shwina
- Add
DecimalBaseColumnand moveas_decimal_column(#9001) @isVoid - Python/Cython bindings for multibyte_split (#8998) @jdye64
- Support scalar
monthsinadd_calendrical_months, extends API to INT32 support (#8991) @isVoid - Added Series.dt.ismonthend (#8989) @TravisHester
- Support for using tdigests to compute approximate percentiles. (#8983) @nvdbaranec
- Support "unflatten" of columns flattened via
flatten_nested_columns(): (#8956) @mythrocks - Implement timestamp ceil (#8942) @shaneding
- Add nested column selection to parquet reader (#8933) @devavret
- Expose conditional join size calculation (#8928) @vyasr
- Support Nulls in Timeseries Generator (#8925) @isVoid
- Avoid index equality check in
_CPackedColumns.from_py_table()(#8917) @charlesbluca - Add dot product binary op (#8909) @charlesbluca
- Expose
days_in_monthfunction in libcudf and add python bindings (#8892) @isVoid - Series string repeat (#8882) @sarahyurick
- Python binding for quarters (#8862) @shaneding
- Expand CSV and JSON reader APIs to accept
dtypesas a vector or map ofdata_typeobjects (#8856) @vuule - Add Java bindings for AST transform (#8846) @jlowe
- Series datetime ismonthstart (#8844) @sarahyurick
- Support bracket syntax for cudf::strings::replacewithbackrefs group index values (#8841) @davidwendt
- Support
VARIANCEandSTDaggregation in rolling op (#8809) @isVoid - Add quarters to libcudf datetime (#8779) @shaneding
- Linear Interpolation of
nans viacupy(#8767) @brandon-b-miller - Enable compiled binary ops in libcudf, python and java (#8741) @karthikeyann
- Make groupby transform-like op order match original data order (#8720) @isVoid
- multibyte_split (#8702) @cwharris
- Implement JNI for
strings:repeat_stringsthat repeats each string separately by different numbers of times (#8572) @ttnghia
π οΈ Improvements
- Pin max
daskanddistributedversions to2021.09.1(#9286) @galipremsagar - Optimized fsspec data transfer for remote file-systems (#9265) @rjzamora
- Skip dask-cudf tests on arm64 (#9252) @Ethyling
- Use nvcomp's snappy compressor in ORC writer (#9242) @devavret
- Only run imports tests on x86_64 (#9241) @Ethyling
- Remove unnecessary call to device_uvector::release() (#9237) @harrism
- Use nvcomp's snappy decompression in ORC reader (#9235) @devavret
- Add grouped_rolling test with STRUCT groupby keys. (#9228) @mythrocks
- Optimize
cudf.concatforaxis=0(#9222) @galipremsagar - Fix some libcudf calls not passing the stream parameter (#9220) @davidwendt
- Add min and max bounds for random dataframe generator numeric types (#9211) @galipremsagar
- Improve performance of expression evaluation (#9210) @vyasr
- Misc optimizations in
cudf(#9203) @galipremsagar - Remove Cython APIs for table view generation (#9199) @vyasr
- Add JNI support for droplistduplicates (#9198) @revans2
- Update pandas versions in conda recipes and requirements.txt files (#9197) @galipremsagar
- Minor C++17 cleanup of
groupby.cu: structured bindings, more concise lambda, etc (#9193) @codereport - Explicit about bitwidth difference between cudf boolean and arrow boolean (#9192) @isVoid
- Remove sourceindex from MultiIndex (#9191) @vyasr
- Fix typo in the name of
cudf-testing-targets.cmake(#9190) @trxcllnt - Add support for single-digits in cudf::to_timestamps (#9173) @davidwendt
- Fix cufilejni build include path (#9168) @pxLi
dask_cudfdispatch registering cleanup (#9160) @galipremsagar- Remove unneeded stream/mr from a cudf::makestringscolumn (#9148) @davidwendt
- Upgrade
pandasversion incudf(#9147) @galipremsagar - make data chunk reader return unique_ptr (#9129) @cwharris
- Add backend for
percentile_lookupdispatch (#9118) @galipremsagar - Refactor implementation of column setitem (#9110) @vyasr
- Fix compile warnings found using nvcc 11.4 (#9101) @davidwendt
- Update to UCX-Py 0.22 (#9099) @pentschev
- Simplify read_avro by removing unnecessary writer/impl classes (#9090) @cwharris
- Allowing %f in format to return nanoseconds (#9081) @marlenezw
- Java bindings for cudf::hash_join (#9080) @jlowe
- Remove stale code in
ColumnBase._fill(#9078) @isVoid - Add support for
get_groupin GroupBy (#9070) @galipremsagar - Remove remaining "support" methods from DataFrame (#9068) @vyasr
- Update JNI java CSV APIs to not use deprecated API (#9066) @revans2
- Added method to remove null_masks if the column has no nulls (#9061) @razajafri
- Consolidate Several Series and Dataframe Methods (#9059) @isVoid
- Remove usage of string based
set_dtypesforcsv&jsonreaders (#9049) @galipremsagar - Remove some debug print statements from gtests (#9048) @davidwendt
- Support additional format specifiers in from_timestamps (#9047) @davidwendt
- Expose expression base class publicly and simplify public AST API (#9045) @vyasr
- move filepath and mmap logic out of json/csv up to functions.cpp (#9040) @cwharris
- Refactor Index hierarchy (#9039) @vyasr
- cudf now leverages rapids-cmake to reduce CMake boilerplate (#9030) @robertmaynard
- Add support for
STRUCTinput togroupby(#9024) @mythrocks - Refactor Frame scans (#9021) @vyasr
- Remove duplicate
set_categoriescode (#9018) @isVoid - Map support for ParquetWriter (#9013) @razajafri
- Remove aliases of various api.types APIs from utils.dtypes. (#9011) @vyasr
- Java bindings for conditional join output sizes (#9002) @jlowe
- Remove copyconstruct factory (#8999) @vyasr
- ENH Allow arbitrary CMake config options in build.sh (#8996) @dillon-cullinan
- A small optimization for JNI copy column view to column vector (#8985) @revans2
- Fix nvcc warnings in ORC writer (#8975) @devavret
- Support nested structs in rank and dense rank (#8962) @rwlee
- Move compute_column API out of ast namespace (#8957) @vyasr
- Series datetime isyearend and isyearstart (#8954) @marlenezw
- Make Java AstNode public (#8953) @jlowe
- Replace allocate with deviceuvector for subwordtokenize internal tables (#8952) @davidwendt
cudf.dtypefunction (#8949) @shwina- Refactor Frame reductions (#8944) @vyasr
- Add deprecation warning for
Series.set_maskAPI (#8943) @galipremsagar - Move AST evaluator into a separate header (#8930) @vyasr
- JNI Aggregation Type Changes (#8919) @revans2
- Move template parameter to function parameter in cudf::detail::leftsemianti_join (#8914) @davidwendt
- Upgrade
arrow&pyarrowto5.0.0(#8908) @galipremsagar - Add groupbyaggregation and groupbyscan_aggregation classes and force their usage. (#8906) @nvdbaranec
- Move
structs_column_tests.cuto.cpp. (#8902) @mythrocks - Add stream and memory-resource parameters to struct-scalar copy ctor (#8901) @davidwendt
- Combine linearizer and ast_plan (#8900) @vyasr
- Add Java bindings for conditional join gather maps (#8888) @jlowe
- Remove max version pin for
dask&distributedon development branch (#8881) @galipremsagar - fix cufilejni build w/ c++17 (#8877) @pxLi
- Add struct accessor to dask-cudf (#8874) @NV-jpt
- Migrate dask-cudf CudfEngine to leverage ArrowDatasetEngine (#8871) @rjzamora
- Add JNI for extractquarter, addcalendricalmonths, and isleap_year (#8863) @revans2
- Change cudf::scalar copy and move constructors to protected (#8857) @davidwendt
- Replace
is_same<>::valuewithis_same_v<>(#8852) @codereport - Add min
pytorchversion toimportorskipin pytest (#8851) @galipremsagar - Java bindings for regex replace (#8847) @jlowe
- Remove make strings children with null mask (#8830) @davidwendt
- Refactor conditional joins (#8815) @vyasr
- Small cleanup (unused headers / commented code removals) (#8799) @codereport
- ENH Replace gpucicondaretry with gpucimambaretry (#8770) @dillon-cullinan
- Update cudf java bindings to 21.10.0-SNAPSHOT (#8765) @pxLi
- Refactor and improve join benchmarks with nvbench (#8734) @PointKernel
- Refactor Python factories and remove usage of Table for libcudf output handling (#8687) @vyasr
- Optimize URL Decoding (#8622) @gaohao95
- Parquet writer dictionary encoding refactor (#8476) @devavret
- Use nvcomp's snappy decompression in parquet reader (#8252) @devavret
- Use nvcomp's snappy compressor in parquet writer (#8229) @devavret
- C++
Published by GPUtester over 4 years ago
https://github.com/rapidsai/cudf - v21.08.03
v21.08.03
- C++
Published by GPUtester over 4 years ago
https://github.com/rapidsai/cudf - v21.08.02
v21.08.02
- C++
Published by GPUtester over 4 years ago
https://github.com/rapidsai/cudf - v21.08.01
v21.08.01
- C++
Published by GPUtester over 4 years ago
https://github.com/rapidsai/cudf - v21.08.00
π¨ Breaking Changes
- Fix a crash in pack() when being handed tables with no columns. (#8697) @nvdbaranec
- Remove unused cudf::strings::create_offsets (#8663) @davidwendt
- Add delimiter parameter to cudf::strings::capitalize() (#8620) @davidwendt
- Change default datetime index resolution to ns to match pandas (#8611) @vyasr
- Add sequence_type parameter to cudf::strings::title function (#8602) @davidwendt
- Add
strings::repeat_stringsAPI that can repeat each string a different number of times (#8561) @ttnghia - String-to-boolean conversion is different from Pandas (#8549) @skirui-source
- Add accurate hash join size functions (#8453) @PointKernel
- Expose a Decimal32Dtype in cuDF Python (#8438) @skirui-source
- Update dask make_meta changes to be compatible with dask upstream (#8426) @galipremsagar
- Adapt
cudf::scalarclasses to changes inrmm::device_scalar(#8411) @harrism - Remove special Index class from the general index class hierarchy (#8309) @vyasr
- Add first-class dtype utilities (#8308) @vyasr
- ORC - Support reading multiple orc files/buffers in a single operation (#8142) @jdye64
- Upgrade arrow to 4.0.1 (#7495) @galipremsagar
π Bug Fixes
- Fix
containscheck in string column (#8834) @galipremsagar - Remove unused variable from
row_bit_count_test. (#8829) @mythrocks - Fixes issue with null struct columns in ORC reader (#8819) @rgsl888prabhu
- Set CMake vars for python/parquet support in libarrow builds (#8808) @vyasr
- Handle empty child columns in rowbitcount() (#8791) @mythrocks
- Revert "Remove cudf unneeded build time requirement of the cuda driver" (#8784) @robertmaynard
- Fix isort error in utils.pyx (#8771) @charlesbluca
- Handle sliced struct/list columns properly in concatenate() bounds checking. (#8760) @nvdbaranec
- Fix issues with
_CPackedColumns.serialize()handling of host and device data (#8759) @charlesbluca - Fix issues with
MultiIndexindropna,stack&reset_index(#8753) @galipremsagar - Write pandas extension types to parquet file metadata (#8749) @devavret
- Fix
whereto handleDataFrame&Seriesinput combination (#8747) @galipremsagar - Fix
replaceto handle null values correctly (#8744) @galipremsagar - Handle sliced structs properly in pack/contiguous_split. (#8739) @nvdbaranec
- Fix issue in slice() where columns with a positive offset were computing null counts incorrectly. (#8738) @nvdbaranec
- Fix
cudf.Seriesconstructor to handle list of sequences (#8735) @galipremsagar - Fix min/max sorted groupby aggregation on string column with nulls (argmin, argmax sentinel value missing on nulls) (#8731) @karthikeyann
- Fix orc reader assert on create data_type in debug (#8706) @davidwendt
- Fix min/max inclusive cudf::scan for strings column (#8705) @davidwendt
- JNI: Fix driver version assertion logic in testGetCudaRuntimeInfo (#8701) @sperlingxx
- Adding fix for skip_rows and crash in orc reader (#8700) @rgsl888prabhu
- Bug fix:
replace_nulls_policyfunctor not returning correct indices for gathermap (#8699) @isVoid - Fix a crash in pack() when being handed tables with no columns. (#8697) @nvdbaranec
- Add post-processing steps to
dask_cudf.groupby.CudfSeriesGroupby.aggregate(#8694) @charlesbluca - JNI build no longer looks for Arrow in conda environment (#8686) @jlowe
- Handle arbitrarily different data in null list column rows when checking for equivalency. (#8666) @nvdbaranec
- Add ConfigureNVBench to avoid concurrent main() entry points (#8662) @PointKernel
- Pin
*arrowto use*cudainrun(#8651) @jakirkham - Add proper support for tolerances in testing methods. (#8649) @vyasr
- Support multi-char case conversion in capitalize function (#8647) @davidwendt
- Fix repeated mangled names in read_csv with duplicate column names (#8645) @karthikeyann
- Temporarily disable libcudf example build tests (#8642) @isVoid
- Use conda-sourced cudf artifacts for libcudf example in CI (#8638) @isVoid
- Ensure dev environment uses Arrow GPU packages (#8637) @charlesbluca
- Fix bug that columns only initialized once when specified
columnsandindexin dataframe ctor (#8628) @isVoid - Propagate *kwargs through to as__column methods (#8618) @shwina
- Fix orcreaderbenchmark.cpp compile error (#8609) @davidwendt
- Fix missed renumbering of Aggregation values (#8600) @revans2
- Update cmake to 3.20.5 in the Java Docker image (#8593) @NvTimLiu
- Fix bug in replacewithbackrefs when group has greedy quantifier (#8575) @davidwendt
- Apply metadata to keys before returning in
Frame._encode(#8560) @charlesbluca - Fix for strings containing special JSON characters in getjsonobject(). (#8556) @nvdbaranec
- Fix debug compile error in gatherstructtests.cpp (#8554) @davidwendt
- String-to-boolean conversion is different from Pandas (#8549) @skirui-source
- Fix
__repr__output withdisplay.max_rowsisNone(#8547) @galipremsagar - Fix size passed to column constructors in withtype_metadata (#8539) @shwina
- Properly retrieve last column when
-1is specified for column index (#8529) @isVoid - Fix importing
applyfromdask(#8517) @galipremsagar - Fix offset of the string dictionary length stream (#8515) @vuule
- Fix double counting of selected columns in CSV reader (#8508) @ochan1
- Incorrect map size in scattertogather corrupts struct columns (#8507) @gerashegalov
- replace_nulls properly propagates memory resource to gather calls (#8500) @robertmaynard
- Disallow groupby aggs for
StructColumns(#8499) @charlesbluca - Fixes out-of-bounds access for small files in unzip (#8498) @elstehle
- Adding support for writing empty dataframe (#8490) @shaneding
- Fix exclusive scan when including nulls and improve testing (#8478) @harrism
- Add workaround for crash in libcudf debug build using outputindexalator in thrust::lowerbound (#8432) @davidwendt
- Install only the same Thrust files that Thrust itself installs (#8420) @robertmaynard
- Add nightly version for ucx-py in ci script (#8419) @galipremsagar
- Fix nullequality config of rollingcollect_set (#8415) @sperlingxx
- CollectSetAggregation: implement RollingAggregation interface (#8406) @sperlingxx
- Handle pre-sliced nested columns in contiguous_split. (#8391) @nvdbaranec
- Fix bitmask_tests.cpp host accessing device memory (#8370) @davidwendt
- Fix concurrentunorderedmap to prevent accessing padding bits in pair_type (#8348) @davidwendt
- BUG FIX: Raise appropriate strings error when concatenating strings column (#8290) @skirui-source
- Make gpuCI and pre-commit style configurations consistent (#8215) @charlesbluca
- Add collect list to dask-cudf groupby aggregations (#8045) @charlesbluca
π Documentation
- Update Python UDFs notebook (#8810) @brandon-b-miller
- Fix dask.dataframe API docs links after reorg (#8772) @jsignell
- Fix instructions for running cuDF/dask-cuDF tests in CONTRIBUTING.md (#8724) @shwina
- Translate Markdown documentation to rST and remove recommonmark (#8698) @vyasr
- Fixed spelling mistakes in libcudf documentation (#8664) @karthikeyann
- Custom Sphinx Extension:
PandasCompat(#8643) @isVoid - Fix README.md (#8535) @ajschmidt8
- Change namespace contains_nulls to struct (#8523) @davidwendt
- Add info about NVTX ranges to dev guide (#8461) @jrhemstad
- Fixed documentation bug in groupby agg method (#8325) @ahmet-uyar
π New Features
- Fix concatenating structs (#8811) @shaneding
- Implement JNI for groupby aggregations
M2andMERGE_M2(#8763) @ttnghia - Bump
isortto5.6.4and removeisortoverrides made for 5.0.7 (#8755) @charlesbluca - Implement
__setitem__forStructColumn(#8737) @shaneding - Add
is_leap_yeartoDateTimePropertiesandDatetimeIndex(#8736) @isVoid - Add
struct.explode()method (#8729) @shwina - Add
DataFrame.to_struct()method to convert a DataFrame to a struct Series (#8728) @shwina - Add support for list type in ORC writer (#8723) @vuule
- Fix slicing from struct columns and accessing struct columns (#8719) @shaneding
- Add
datetime::is_leap_year(#8711) @isVoid - Accessing struct columns from
dask_cudf(#8675) @shaneding - Added pct_change to Series (#8650) @TravisHester
- Add strings support to cudf::shift function (#8648) @davidwendt
- Support Scatter
struct_scalar(#8630) @isVoid - Struct scalar from host dictionary (#8629) @shaneding
- Add dayofyear and dayofyear to Series, DatetimeColumn, and DatetimeIndex (#8626) @beckernick
- JNI support for capitalize (#8624) @firestarman
- Add delimiter parameter to cudf::strings::capitalize() (#8620) @davidwendt
- Add NVBench in CMake (#8619) @PointKernel
- Change default datetime index resolution to ns to match pandas (#8611) @vyasr
- ListColumn
__setitem__(#8606) @brandon-b-miller - Implement groupby aggregations
M2andMERGE_M2(#8605) @ttnghia - Add sequence_type parameter to cudf::strings::title function (#8602) @davidwendt
- Adding support for list and struct type in ORC Reader (#8599) @rgsl888prabhu
- Benchmark for
strings::repeat_stringsAPIs (#8589) @ttnghia - Nested scalar support for copy if else (#8588) @gerashegalov
- User specified decimal columns to float64 (#8587) @jdye64
- Add
get_elementfor struct column (#8578) @isVoid - Python changes for adding
__getitem__forstruct(#8577) @shaneding - Add
strings::repeat_stringsAPI that can repeat each string a different number of times (#8561) @ttnghia - Refactor
tests/iterator_utilities.hppfunctions (#8540) @ttnghia - Support MERGELISTS and MERGESETS in Java package (#8516) @sperlingxx
- Decimal support csv reader (#8511) @elstehle
- Add column type tests (#8505) @isVoid
- Warn when downscaling decimal columns (#8492) @ChrisJar
- Add JNI for
strings::repeat_strings(#8491) @ttnghia - Add
Index.get_locfor Numerical, String Index support (#8489) @isVoid - Expose half_up rounding in cuDF (#8477) @shwina
- Java APIs to fetch CUDA runtime info (#8465) @sperlingxx
- Add
str.edit_distance_matrix(#8463) @isVoid - Support constructing
cudf.Scalarobjects from host side lists (#8459) @brandon-b-miller - Add accurate hash join size functions (#8453) @PointKernel
- Add cudf::strings::integertohex convert API (#8450) @davidwendt
- Create objects from iterables that contain cudf.NA (#8442) @brandon-b-miller
- JNI bindings for sort_lists (#8439) @sperlingxx
- Expose a Decimal32Dtype in cuDF Python (#8438) @skirui-source
- Replace
all_null()andall_valid()byiterator_all_nulls()anditerator_no_null()in tests (#8437) @ttnghia - Implement groupby
MERGE_LISTSandMERGE_SETSaggregates (#8436) @ttnghia - Add public libcudf match_dictionaries API (#8429) @davidwendt
- Add move constructors for
string_scalarandstruct_scalar(#8428) @ttnghia - Implement
strings::repeat_strings(#8423) @ttnghia - STRUCT column support for cudf::merge. (#8422) @nvdbaranec
- Implement reverse in libcudf (#8410) @shaneding
- Support multiple input files/buffers for read_json (#8403) @jdye64
- Improve test coverage for struct search (#8396) @ttnghia
- Add
groupby.fillna(#8362) @isVoid - Enable AST-based joining (#8214) @vyasr
- Generalized null support in user defined functions (#8213) @brandon-b-miller
- Add compiled binary operation (#8192) @karthikeyann
- Implement
.describe()forDataFrameGroupBy(#8179) @skirui-source - ORC - Support reading multiple orc files/buffers in a single operation (#8142) @jdye64
- Add Python bindings for
lists::concatenate_list_elementsand expose them as.list.concat()(#8006) @shwina - Use Arrow URI FileSystem backed instance to retrieve remote files (#7709) @jdye64
- Example to build custom application and link to libcudf (#7671) @isVoid
- Upgrade arrow to 4.0.1 (#7495) @galipremsagar
π οΈ Improvements
- Provide a better error message when
CUDA::cuda_drivernot found (#8794) @robertmaynard - Remove anonymous namespace from null_mask.cuh (#8786) @nvdbaranec
- Allow cudf to be built without libcuda.so existing (#8751) @robertmaynard
- Pin
mimesisto<4.1(#8745) @galipremsagar - Update
condaenvironment name for CI (#8692) @ajschmidt8 - Remove flatbuffers dependency (#8671) @Ethyling
- Add options to build Arrow with Python and Parquet support (#8670) @trxcllnt
- Remove unused cudf::strings::create_offsets (#8663) @davidwendt
- Update GDS lib version to 1.0.0 (#8654) @pxLi
- Support for groupby/scan rank and dense_rank aggregations (#8652) @rwlee
- Fix usage of deprecated arrow ipc API (#8632) @revans2
- Use absolute imports in
cudf(#8631) @galipremsagar - ENH Add Java CI build script (#8627) @dillon-cullinan
- Add DeprecationWarning to
ser.str.subword_tokenize(#8603) @VibhuJawa - Rewrite binary operations for improved performance and additional type support (#8598) @vyasr
- Fix
mypyerrors surfacing because ofnumpy-1.21.0(#8595) @galipremsagar - Remove unneeded includes from cudf::string_view headers (#8594) @davidwendt
- Use cmake 3.20.1 as it is now required by rmm (#8586) @robertmaynard
- Remove device debug symbols from cmake CUDFCUDAFLAGS (#8584) @davidwendt
- Dask-CuDF: use default Dask Dataframe optimizer (#8581) @madsbk
- Remove checking if an unsigned value is less than zero (#8579) @robertmaynard
- Remove stringscount parameter from cudf::strings::detail::createcharschildcolumn (#8576) @davidwendt
- Make
cudf.api.typesimports consistent (#8571) @galipremsagar - Modernize libcudf basic example CMakeFile; updates CI build tests (#8568) @isVoid
- Rename concatenate_tests.cu to .cpp (#8555) @davidwendt
- enable window lead/lag test on struct (#8548) @wbo4958
- Add Java methods to split and write column views (#8546) @razajafri
- Small cleanup (#8534) @codereport
- Unpin
daskversion in CI (#8533) @galipremsagar - Added optional flag for building Arrow with S3 filesystem support (#8531) @jdye64
- Minor clean up of various internal column and frame utilities (#8528) @vyasr
- Rename some copying_test source files .cu to .cpp (#8527) @davidwendt
- Correct the last warnings and issues when using newer cuda versions (#8525) @robertmaynard
- Correct unused parameter warnings in transform and unary ops (#8521) @robertmaynard
- Correct unused parameter warnings in string algorithms (#8509) @robertmaynard
- Add in JNI APIs for scan, replacenulls, groupby.scan, and groupby.replacenulls (#8503) @revans2
- Fix
21.08forward-merge conflicts (#8502) @ajschmidt8 - Fix Cython formatting command in Contributing.md. (#8496) @marlenezw
- Bug/correct unused parameters in reshape and text (#8495) @robertmaynard
- Correct unused parameter warnings in partitioning and stream compact (#8494) @robertmaynard
- Correct unused parameter warnings in labelling and list algorithms (#8493) @robertmaynard
- Refactor index construction (#8485) @vyasr
- Correct unused parameter warnings in replace algorithms (#8483) @robertmaynard
- Correct unused parameter warnings in reduction algorithms (#8481) @robertmaynard
- Correct unused parameter warnings in io algorithms (#8480) @robertmaynard
- Correct unused parameter warnings in interop algorithms (#8479) @robertmaynard
- Correct unused parameter warnings in filling algorithms (#8468) @robertmaynard
- Correct unused parameter warnings in groupby (#8467) @robertmaynard
- use libcu++ time_point as timestamp (#8466) @karthikeyann
- Modify reprog_device::extract to return groups in a single pass (#8460) @davidwendt
- Update minimum Dask requirement to 2021.6.0 (#8458) @pentschev
- Fix failures when performing binary operations on DataFrames with empty columns (#8452) @ChrisJar
- Fix conflicts in
8447(#8448) @ajschmidt8 - Add serialization methods for
ListandStructDtype(#8441) @charlesbluca - Replace makeemptystringscolumn with makeempty_column (#8435) @davidwendt
- JNI bindings for get_element (#8433) @revans2
- Update dask make_meta changes to be compatible with dask upstream (#8426) @galipremsagar
- Unpin dask version on CI (#8425) @galipremsagar
- Add benchmark for strings/fixed_point convert APIs (#8417) @davidwendt
- Adapt
cudf::scalarclasses to changes inrmm::device_scalar(#8411) @harrism - Add benchmark for strings/integers convert APIs (#8402) @davidwendt
- Enable multi-file partitioning in daskcudf.readparquet (#8393) @rjzamora
- Correct unused parameter warnings in rolling algorithms (#8390) @robertmaynard
- Correct unused parameters in column round and search (#8389) @robertmaynard
- Add functionality to apply
Dtypemetadata toColumnBase(#8373) @charlesbluca - Refactor setting stack size in regex code (#8358) @davidwendt
- Update Java bindings to 21.08-SNAPSHOT (#8344) @pxLi
- Replace remaining uses of device_vector (#8343) @harrism
- Statically link libnvcomp into libcudfjni (#8334) @jlowe
- Resolve auto merge conflicts for Branch 21.08 from branch 21.06 (#8329) @galipremsagar
- Minor code refactor for sorted_order (#8326) @wbo4958
- Remove special Index class from the general index class hierarchy (#8309) @vyasr
- Add first-class dtype utilities (#8308) @vyasr
- Add option to link Java bindings with Arrow dynamically (#8307) @jlowe
- Refactor ColumnMethods and its subclasses to remove
columnargument and requireparentargument (#8306) @shwina - Refactor
scatterfor list columns (#8255) @isVoid - Expose pack/unpack API to Python (#8153) @charlesbluca
- Adding cudf.cut method (#8002) @marlenezw
- Optimize string gather performance for large strings (#7980) @gaohao95
- Add peak memory usage tracking to cuIO benchmarks (#7770) @devavret
- Updating Clang Version to 11.0.0 (#6695) @codereport
- C++
Published by GPUtester over 4 years ago
https://github.com/rapidsai/cudf - v21.06.00
π¨ Breaking Changes
- Add support for
make_meta_objdispatch indask-cudf(#8342) @galipremsagar - Add separator-on-null parameter to strings concatenate APIs (#8282) @davidwendt
- Introduce a common parent class for NumericalColumn and DecimalColumn (#8278) @vyasr
- Update ORC statistics API to use C++17 standard library (#8241) @vuule
- Preserve column hierarchy when getting NULL row from
LISTcolumn (#8206) @isVoid Groupby.shiftc++ API refactor and python binding (#8131) @isVoid
π Bug Fixes
- Fix struct flattening to add a validity column only when the input column has null element (#8374) @ttnghia
- Compilation fix: Remove redefinition for
std::is_same_v()(#8369) @mythrocks - Add backward compatibility for
dask-cudfto work with other versions ofdask(#8368) @galipremsagar - Handle empty results with nested types in copyifelse (#8359) @nvdbaranec
- Handle nested column types properly for empty parquet files. (#8350) @nvdbaranec
- Raise error when unsupported arguments are passed to
dask_cudf.DataFrame.sort_values(#8349) @galipremsagar - Raise
NotImplementedErrorfor axis=1 inrank(#8347) @galipremsagar - Add support for
make_meta_objdispatch indask-cudf(#8342) @galipremsagar - Update Java string concatenate test for single column (#8330) @tgravescs
- Use empty_like in scatter (#8314) @revans2
- Fix concatenatelistsignorenull on rows of allnulls (#8312) @sperlingxx
- Add separator-on-null parameter to strings concatenate APIs (#8282) @davidwendt
- COLLECT_LIST support returning empty output columns. (#8279) @mythrocks
- Update io util to convert path like object to string (#8275) @ayushdg
- Fix result column types for empty inputs to rolling window (#8274) @mythrocks
- Actually test equality in assertgroupbyresults_equal (#8272) @shwina
- CMake always explicitly specify a source files extension (#8270) @robertmaynard
- Fix struct binary search and struct flattening (#8268) @ttnghia
- Revert "patch thrust to fix intmax num elements limitation in scanbykey" (#8263) @cwharris
- upgrade dlpack to 0.5 (#8262) @cwharris
- Fixes CSV-reader type inference for thousands separator and decimal point (#8261) @elstehle
- Fix incorrect assertion in Java concat (#8258) @sperlingxx
- Copy nested types upon construction (#8244) @isVoid
- Preserve column hierarchy when getting NULL row from
LISTcolumn (#8206) @isVoid - Clip decimal binary op precision at max precision (#8194) @ChrisJar
π Documentation
- Add docstring for
dask_cudf.read_csv(#8355) @galipremsagar - Fix cudf release version in readme (#8331) @galipremsagar
- Fix structs column description in dev docs (#8318) @isVoid
- Update readme with correct CUDA versions (#8315) @raydouglass
- Add description of the cuIO GDS integration (#8293) @vuule
- Remove unused parameter from copy_partition kernel documentation (#8283) @robertmaynard
π New Features
- Add support merging b/w categorical data (#8332) @galipremsagar
- Java: Support struct scalar (#8327) @sperlingxx
- added ishomogeneous property (#8299) @shaneding
- Added decimal writing for CSV writer (#8296) @kaatish
- Java: Support creating a scalar from utf8 string (#8294) @firestarman
- Add Java API for Concatenate strings with separator (#8289) @tgravescs
strings::join_list_elementsoptions for empty list inputs (#8285) @ttnghia- Return python lists for getitem calls to list type series (#8265) @brandon-b-miller
- add unit tests for lead/lag on list for row window (#8259) @wbo4958
- Create a String column from UTF8 String byte arrays (#8257) @firestarman
- Support scattering
list_scalar(#8256) @isVoid - Implement
lists::concatenate_list_elements(#8231) @ttnghia - Support for struct scalars. (#8220) @nvdbaranec
- Add support for decimal types in ORC writer (#8198) @vuule
- Support create lists column from a
list_scalar(#8185) @isVoid Groupby.shiftc++ API refactor and python binding (#8131) @isVoid- Add
groupby::replace_nulls(replace_policy)api (#7118) @isVoid
π οΈ Improvements
- Support Dask + Distributed 2021.05.1 (#8392) @jakirkham
- Add aliases for string methods (#8353) @shwina
- Update environment variable used to determine
cuda_version(#8321) @ajschmidt8 - JNI: Refactor the code of making column from scalar (#8310) @firestarman
- Update
CHANGELOG.mdlinks for calver (#8303) @ajschmidt8 - Merge
branch-0.19intobranch-21.06(#8302) @ajschmidt8 - use address and length for GDS reads/writes (#8301) @rongou
- Update cudfjni version to 21.06.0 (#8292) @pxLi
- Update docs build script (#8284) @ajschmidt8
- Make device_buffer streams explicit and enforce move construction (#8280) @harrism
- Introduce a common parent class for NumericalColumn and DecimalColumn (#8278) @vyasr
- Do not add nulls to the hash table when nullequality::NOTEQUAL is passed to leftsemijoin and leftantijoin (#8277) @nvdbaranec
- Enable implicit casting when concatenating mixed types (#8276) @ChrisJar
- Fix CMake FindPackage rmm, pin dev envs' dlpack to v0.3 (#8271) @trxcllnt
- Update cudfjni version to 21.06 (#8267) @pxLi
- support RMM aligned resource adapter in JNI (#8266) @rongou
- Pass compiler environment variables to conda python build (#8260) @Ethyling
- Remove abc inheritance from Serializable (#8254) @vyasr
- Move more methods into SingleColumnFrame (#8253) @vyasr
- Update ORC statistics API to use C++17 standard library (#8241) @vuule
- Correct unused parameter warnings in dictonary algorithms (#8239) @robertmaynard
- Correct unused parameters in the copying algorithms (#8232) @robertmaynard
- IO statistics cleanup (#8191) @kaatish
- Refactor of rolling_window implementation. (#8158) @nvdbaranec
- Add a flag for allowing single quotes in JSON strings. (#8144) @nvdbaranec
- Column refactoring 2 (#8130) @vyasr
- support space in workspace (#7956) @jolorunyomi
- Support collect_set on rolling window (#7881) @sperlingxx
- C++
Published by GPUtester over 4 years ago
https://github.com/rapidsai/cudf - v0.19.2
π¨ Breaking Changes
- Allow hash_partition to take a seed value (#7771) @magnatelee
- Allow merging index column with data column using keyword "on" (#7736) @skirui-source
- Change JNI API to avoid loading native dependencies when creating sort order classes. (#7729) @revans2
- Replace devicevector with deviceuvector in null_mask (#7715) @harrism
- Don't identify decimals as strings. (#7710) @vyasr
- Fix Java Parquet write after writer API changes (#7655) @revans2
- Convert cudf::concatenate APIs to use spans and device_uvector (#7621) @harrism
- Update missing docstring examples in python public APIs (#7546) @galipremsagar
- Remove unneeded step parameter from strings::detail::copy_slice (#7525) @davidwendt
- Rename ARROWSTATICLIB because it conflicts with one in FindArrow.cmake (#7518) @trxcllnt
- Match Pandas logic for comparing two objects with nulls (#7490) @brandon-b-miller
- Add struct support to parquet writer (#7461) @devavret
- Join APIs that return gathermaps (#7454) @shwina
fixed_point+cudf::binary_operationAPI Changes (#7435) @codereport- Fix BUG: Exception when PYTHONOPTIMIZE=2 (#7434) @skirui-source
- Change nvtext::loadvocabularyfile to return a unique ptr (#7424) @davidwendt
- Refactor strings column factories (#7397) @harrism
- Use CMAKECUDAARCHITECTURES (#7391) @robertmaynard
- Upgrade pandas to 1.2 (#7375) @galipremsagar
- Rename
logical_casttobit_castand allow additional conversions (#7373) @ttnghia - Rework libcudf CMakeLists.txt to export targets for CPM (#7107) @trxcllnt
π Bug Fixes
- unsnap: busy wait a number of cycles (#8073) @vuule
- Fix returned column type when extracting from an empty list column (#8031) @jlowe
- Don't reindex an new value on setitem if the original dataframe was empty (#8026) @vyasr
- Fix a
NameErrorin meta dispatch API (#7996) @galipremsagar - Reindex in
DataFrame.__setitem__(#7957) @galipremsagar - jitify direct-to-cubin compilation and caching. (#7919) @cwharris
- Use dynamic cudart for nvcomp in java build (#7896) @abellina
- fix "incompatible redefinition" warnings (#7894) @cwharris
- cudf consistently specifies the cuda runtime (#7887) @robertmaynard
- disable verbose output for jitify_preprocess (#7886) @cwharris
- CMake jitpreprocessfiles function only runs when needed (#7872) @robertmaynard
- Push DeviceScalar construction into cython for list.contains (#7864) @brandon-b-miller
- cudf now sets an install rpath of $ORIGIN (#7863) @robertmaynard
- Don't install Thrust examples, tests, docs, and python files (#7811) @robertmaynard
- Sort by index in groupby tests more consistently (#7802) @shwina
- Revert "Update conda recipes pinning of repo dependencies (#7743)" (#7793) @raydouglass
- Add decimal column handling in copytypemetadata (#7788) @shwina
- Add column names validation in parquet writer (#7786) @galipremsagar
- Fix Java explode outer unit tests (#7782) @jlowe
- Fix compiler warning about non-POD types passed through ellipsis (#7781) @jrhemstad
- User resource fix for replace_nulls (#7769) @magnatelee
- Fix type dispatch for columnar replace_nulls (#7768) @jlowe
- Add
ignore_orderparameter to dask-cudf concat dispatch (#7765) @galipremsagar - Fix slicing and arrow representations of decimal columns (#7755) @vyasr
- Fixing issue with explode_outer position not nulling position entries of null rows (#7754) @hyperbolic2346
- Implement scatter for struct columns (#7752) @ttnghia
- Fix data corruption in string columns (#7746) @galipremsagar
- Fix string length in stripe dictionary building (#7744) @kaatish
- Update conda recipes pinning of repo dependencies (#7743) @mike-wendt
- Enable dask dispatch to cuDF's
is_categorical_dtypefor cuDF objects (#7740) @brandon-b-miller - Fix dictionary size computation in ORC writer (#7737) @vuule
- Fix
cudf::castoverflow fordecimal64toint32_tor smaller in certain cases (#7733) @codereport - Change JNI API to avoid loading native dependencies when creating sort order classes. (#7729) @revans2
- Disable column_view data accessors for unsupported types (#7725) @jrhemstad
- Materialize
RangeIndexwhenindex=Truein parquet writer (#7711) @galipremsagar - Don't identify decimals as strings. (#7710) @vyasr
- Fix return type of
DataFrame.argsort(#7706) @galipremsagar - Fix/correct cudf installed package requirements (#7688) @robertmaynard
- Fix SparkMurmurHash3_32 hash inconsistencies with Apache Spark (#7672) @jlowe
- Fix ORC reader issue with reading empty string columns (#7656) @rgsl888prabhu
- Fix Java Parquet write after writer API changes (#7655) @revans2
- Fixing empty null lists throwing explode_outer for a loop. (#7649) @hyperbolic2346
- Fix internal compiler error during JNI Docker build (#7645) @jlowe
- Fix Debug build break with deviceuvectors in groupedrolling.cu (#7633) @mythrocks
- Parquet reader: Fix issue when using skip_rows on non-nested columns containing nulls (#7627) @nvdbaranec
- Fix ORC reader for empty DataFrame/Table (#7624) @rgsl888prabhu
- Fix specifying GPU architecture in JNI build (#7612) @jlowe
- Fix ORC writer OOM issue (#7605) @vuule
- Fix 0.18 --> 0.19 automerge (#7589) @kkraus14
- Fix ORC issue with incorrect timestamp nanosecond values (#7581) @vuule
- Fix missing Dask imports (#7580) @kkraus14
- CMAKECUDAARCHITECTURES doesn't change when build-system invokes cmake (#7579) @robertmaynard
- Another fix for offsetsend() iterator in listscolumn_view (#7575) @ttnghia
- Fix ORC writer output corruption with string columns (#7565) @vuule
- Fix cudf::lists::sort_lists failing for sliced column (#7564) @ttnghia
- FIX Fix Anaconda upload args (#7558) @dillon-cullinan
- Fix index mismatch issue in equality related APIs (#7555) @galipremsagar
- FIX Revert gpucicondaretry on conda file output locations (#7552) @dillon-cullinan
- Fix offsetend iterator for listscolumn_view, which was not correctl⦠(#7551) @ttnghia
- Fix no such file dlpack.h error when build libcudf (#7549) @chenrui17
- Update missing docstring examples in python public APIs (#7546) @galipremsagar
- Decimal32 Build Fix (#7544) @razajafri
- FIX Retry conda output location (#7540) @dillon-cullinan
- fix missing renames of dask git branches from master to main (#7535) @kkraus14
- Remove detail from device_span (#7533) @rwlee
- Change dask and distributed branch to main (#7532) @dantegd
- Update JNI build to use CUDFUSEARROW_STATIC (#7526) @jlowe
- Make sure rmm::rmm CMake target is visibile to cudf users (#7524) @robertmaynard
- Fix contiguous_split not properly handling output partitions > 2 GB. (#7515) @nvdbaranec
- Change jit launch to safe_launch (#7510) @devavret
- Fix comparison between Datetime/Timedelta columns and NULL scalars (#7504) @brandon-b-miller
- Fix off-by-one error in char-parallel string scalar replace (#7502) @jlowe
- Fix JNI deprecation of all, put it on the wrong version before (#7501) @revans2
- Fix Series/Dataframe Mixed Arithmetic (#7491) @brandon-b-miller
- Fix JNI build after removal of libcudf sub-libraries (#7486) @jlowe
- Correctly compile benchmarks (#7485) @robertmaynard
- Fix bool column corruption with ORC Reader (#7483) @rgsl888prabhu
- Fix
__repr__for categorical dtype (#7476) @galipremsagar - Java cleaner synchronization (#7474) @abellina
- Fix java float/double parsing tests (#7473) @revans2
- Pass stream and user resource to makedefaultconstructed_scalar (#7469) @magnatelee
- Improve stability of daskcudf.DataFrame.var and daskcudf.DataFrame.std (#7453) @rjzamora
- Missing
device_storage_dispatchchange affectingcudf::gather(#7449) @codereport - fix cuFile JNI compile errors (#7445) @rongou
- Support
Series.__setitem__with key to a new row (#7443) @isVoid - Fix BUG: Exception when PYTHONOPTIMIZE=2 (#7434) @skirui-source
- Make inclusive scan safe for cases with leading nulls (#7432) @magnatelee
- Fix typo in listdeviceview::pairrepend() (#7423) @mythrocks
- Fix string to double conversion and row equivalent comparison (#7410) @ttnghia
- Fix thrust failure when transfering data from devicevector to hostvector with vectors of size 1 (#7382) @ttnghia
- Fix std::exeception catch-by-reference gcc9 compile error (#7380) @davidwendt
- Fix skiprows issue with ORC Reader (#7359) @rgsl888prabhu
- fix Arrow CMake file (#7358) @rongou
- Fix lists::contains() for NaN and Decimals (#7349) @mythrocks
- Handle cupy array in
Dataframe.__setitem__(#7340) @galipremsagar - Fix invalid-device-fn error in cudf::strings::replace_re with multiple regex's (#7336) @davidwendt
- FIX Add codecov upload block to gpu script (#6860) @dillon-cullinan
π Documentation
- Fix join API doxygen (#7890) @shwina
- Add Resources to README. (#7697) @bdice
- Add
isinexamples in Docstring (#7479) @galipremsagar - Resolving unlinked type shorthands in cudf doc (#7416) @isVoid
- Fix typo in regex.md doc page (#7363) @davidwendt
- Fix incorrect stringscolumnview::chars_size documentation (#7360) @jlowe
π New Features
- Enable basic reductions for decimal columns (#7776) @ChrisJar
- Enable join on decimal columns (#7764) @ChrisJar
- Allow merging index column with data column using keyword "on" (#7736) @skirui-source
- Implement DecimalColumn + Scalar and add cudf.Scalars of Decimal64Dtype (#7732) @brandon-b-miller
- Add support for
uniquegroupby aggregation (#7726) @shwina - Expose libcudf's label_bins function to cudf (#7724) @vyasr
- Adding support for equi-join on struct (#7720) @hyperbolic2346
- Add decimal column comparison operations (#7716) @isVoid
- Implement scan operations for decimal columns (#7707) @ChrisJar
- Enable typecasting between decimal and int (#7691) @ChrisJar
- Enable decimal support in parquet writer (#7673) @devavret
- Adds
list.uniqueAPI (#7664) @isVoid - Fix NaN handling in droplistduplicates (#7662) @ttnghia
- Add
lists.sort_valuesAPI (#7657) @isVoid - Add is_integer API that can check for the validity of a string-to-integer conversion (#7642) @ttnghia
- Adds
explodeAPI (#7607) @isVoid - Adds
list.take, python binding forcudf::lists::segmented_gather(#7591) @isVoid - Implement cudf::label_bins() (#7554) @vyasr
- Add Python bindings for
lists::contains(#7547) @skirui-source - cudf::rowbitcount() support. (#7534) @nvdbaranec
- Implement droplistduplicates (#7528) @ttnghia
- Add Python bindings for
lists::extract_lists_element(#7505) @skirui-source - Add explodeouter and explodeouter_position (#7499) @hyperbolic2346
- Match Pandas logic for comparing two objects with nulls (#7490) @brandon-b-miller
- Add struct support to parquet writer (#7461) @devavret
- Enable type conversion from float to decimal type (#7450) @ChrisJar
- Add cython for converting strings/fixed-point functions (#7429) @davidwendt
- Add struct column support to cudf::sort and cudf::sorted_order (#7422) @karthikeyann
- Implement groupby collect_set (#7420) @ttnghia
- Merge branch-0.18 into branch-0.19 (#7411) @raydouglass
- Refactor strings column factories (#7397) @harrism
- Add groupby scan operations (sort groupby) (#7387) @karthikeyann
- Add cudf::explode_position (#7376) @hyperbolic2346
- Add string conversion to/from decimal values libcudf APIs (#7364) @davidwendt
- Add groupby SUMOFSQUARES support (#7362) @karthikeyann
- Add
Series.dropapi (#7304) @isVoid - getjsonobject() implementation (#7286) @nvdbaranec
- Python API for
LIstMethods.len()(#7283) @isVoid - Support null_policy::EXCLUDE for COLLECT rolling aggregation (#7264) @mythrocks
- Add support for special tokens in nvtext::subword_tokenizer (#7254) @davidwendt
- Fix inplace update of data and add Series.update (#7201) @galipremsagar
- Implement
cudf::group_by(hash) fordecimal32anddecimal64(#7190) @codereport - Adding support to specify "level" parameter for
Dataframe.rename(#7135) @skirui-source
π οΈ Improvements
- fix GDS include path for version 0.95 (#7877) @rongou
- Update
dask+distributedto2021.4.0(#7858) @jakirkham - Add ability to extract include dirs from
CUDF_HOME(#7848) @galipremsagar - Add USE_GDS as an option in build script (#7833) @pxLi
- add an allocate method with stream in java DeviceMemoryBuffer (#7826) @rongou
- Constrain dask and distributed versions to 2021.3.1 (#7825) @shwina
- Revert dask versioning of concat dispatch (#7823) @galipremsagar
- add copy methods in Java memory buffer (#7791) @rongou
- Update README and CONTRIBUTING for 0.19 (#7778) @robertmaynard
- Allow hash_partition to take a seed value (#7771) @magnatelee
- Turn on NVTX by default in java build (#7761) @tgravescs
- Add Java bindings to join gather map APIs (#7751) @jlowe
- Add replacements column support for Java replaceNulls (#7750) @jlowe
- Add Java bindings for rowbitcount (#7749) @jlowe
- Remove unused JVM array creation (#7748) @jlowe
- Added JNI support for new is_integer (#7739) @revans2
- Create and promote library aliases in libcudf installations (#7734) @trxcllnt
- Support groupby operations for decimal dtypes (#7731) @vyasr
- Memory map the input file only when GDS compatiblity mode is not used (#7717) @vuule
- Replace devicevector with deviceuvector in null_mask (#7715) @harrism
- Struct hashing support for SerialMurmur3 and SparkMurmur3 (#7714) @jlowe
- Add gbenchmark for nvtext replace-tokens function (#7708) @davidwendt
- Use stream in groupby calls (#7705) @karthikeyann
- Update codeowners file (#7701) @ajschmidt8
- Cleanup groupby to use hostspan, devicespan, device_uvector (#7698) @karthikeyann
- Add gbenchmark for nvtext ngrams functions (#7693) @davidwendt
- Misc Python/Cython optimizations (#7686) @shwina
- Add gbenchmark for nvtext tokenize functions (#7684) @davidwendt
- Add columndeviceview to orc writer (#7676) @kaatish
- cudf_kafka now uses cuDF CMake export targets (CPM) (#7674) @robertmaynard
- Add gbenchmark for nvtext normalize functions (#7668) @davidwendt
- Resolve unnecessary import of thrust/optional.hpp in types.hpp (#7667) @vyasr
- Feature/optimize accessor copy (#7660) @vyasr
- Fix
find_package(cudf)(#7658) @trxcllnt - Work-around for gcc7 compile error on Centos7 (#7652) @davidwendt
- Add in JNI support for count_elements (#7651) @revans2
- Fix issues with building cudf in a non-conda environment (#7647) @galipremsagar
- Refactor ConfigureCUDA to not conditionally insert compiler flags (#7643) @robertmaynard
- Add gbenchmark for converting strings to/from timestamps (#7641) @davidwendt
- Handle constructing a
cudf.Scalarfrom acudf.Scalar(#7639) @shwina - Add in JNI support for table partition (#7637) @revans2
- Add explicit fixed_point merge test (#7635) @codereport
- Add JNI support for IDENTITY hash partitioning (#7626) @revans2
- Java support on explode_outer (#7625) @sperlingxx
- Java support of casting string from/to decimal (#7623) @sperlingxx
- Convert cudf::concatenate APIs to use spans and device_uvector (#7621) @harrism
- Add gbenchmark for cudf::strings::translate function (#7617) @davidwendt
- Use file(COPY ) over file(INSTALL ) so cmake output is reduced (#7616) @robertmaynard
- Use rmm::deviceuvector in place of rmm::devicevector for ORC reader/writer and cudf::io::column_buffer (#7614) @vuule
- Refactor Java host-side buffer concatenation to expose separate steps (#7610) @jlowe
- Add gbenchmarks for string substrings functions (#7603) @davidwendt
- Refactor string conversion check (#7599) @ttnghia
- JNI: Pass names of children struct columns to native Arrow IPC writer (#7598) @firestarman
- Revert "ENH Fix stale GHA and prevent duplicates " (#7595) @mike-wendt
- ENH Fix stale GHA and prevent duplicates (#7594) @mike-wendt
- Fix auto-detecting GPU architectures (#7593) @trxcllnt
- Reduce cudf library size (#7583) @robertmaynard
- Optimize cudf::makestringscolumn for long strings (#7576) @davidwendt
- Always build and export the cudf::cudftestutil target (#7574) @trxcllnt
- Eliminate literal parameters to uvector::setelementasync and devicescalar::setvalue (#7563) @harrism
- Add gbenchmark for strings::concatenate (#7560) @davidwendt
- Update Changelog Link (#7550) @ajschmidt8
- Add gbenchmarks for strings replace regex functions (#7541) @davidwendt
- Add
__repr__for Column and ColumnAccessor (#7531) @shwina - Support Decimal DIV changes in cudf (#7527) @razajafri
- Remove unneeded step parameter from strings::detail::copy_slice (#7525) @davidwendt
- Use deviceuvector, devicespan in sort groupby (#7523) @karthikeyann
- Add gbenchmarks for strings extract function (#7522) @davidwendt
- Rename ARROWSTATICLIB because it conflicts with one in FindArrow.cmake (#7518) @trxcllnt
- Reduce compile time/size for scan.cu (#7516) @davidwendt
- Change devicevector to deviceuvector in nvtext source files (#7512) @davidwendt
- Removed unneeded includes from traits.hpp (#7509) @davidwendt
- FIX Remove random build directory generation for ccache (#7508) @dillon-cullinan
- xfail failing pytest in pandas 1.2.3 (#7507) @galipremsagar
- JNI bit cast (#7493) @revans2
- Combine rolling window function tests (#7480) @mythrocks
- Prepare Changelog for Automation (#7477) @ajschmidt8
- Java support for explode position (#7471) @sperlingxx
- Update 0.18 changelog entry (#7463) @ajschmidt8
- JNI: Support skipping nulls for collect aggregation (#7457) @firestarman
- Join APIs that return gathermaps (#7454) @shwina
- Remove dependence on managed memory for multimap test (#7451) @jrhemstad
- Use cuFile for Parquet IO when available (#7444) @vuule
- Statistics cleanup (#7439) @kaatish
- Add gbenchmarks for strings filter functions (#7438) @davidwendt
fixed_point+cudf::binary_operationAPI Changes (#7435) @codereport- Improve string gather performance (#7433) @jlowe
- Don't use user resource for a temporary allocation in sortbykey (#7431) @magnatelee
- Detail APIs for datetime functions (#7430) @magnatelee
- Replace thrust::maxelement with thrust::reduce in strings findallre (#7428) @davidwendt
- Add gbenchmark for strings split/split_record functions (#7427) @davidwendt
- Update JNI build to use CMAKECUDAARCHITECTURES (#7425) @jlowe
- Change nvtext::loadvocabularyfile to return a unique ptr (#7424) @davidwendt
- Simplify type dispatch with
device_storage_dispatch(#7419) @codereport - Java support for casting of nested child columns (#7417) @razajafri
- Improve scalar string replace performance for long strings (#7415) @jlowe
- Remove unneeded temporary device vector for strings scatter specialization (#7409) @davidwendt
- bitmask_or implementation with bitmask refactor (#7406) @rwlee
- Add other cudf::strings::replace functions to current strings replace gbenchmark (#7403) @davidwendt
- Clean up included headers in
device_operators.cuh(#7401) @codereport - Move nullable index iterator to indexalator factory (#7399) @davidwendt
- ENH Pass ccache variables to conda recipe & use Ninja in CI (#7398) @Ethyling
- upgrade maven-antrun-plugin to support maven parallel builds (#7393) @rongou
- Add gbenchmark for strings find/contains functions (#7392) @davidwendt
- Use CMAKECUDAARCHITECTURES (#7391) @robertmaynard
- Refactor libcudf strings::replace to use makestringschildren utility (#7384) @davidwendt
- Added in JNI support for out of core sort algorithm (#7381) @revans2
- Upgrade pandas to 1.2 (#7375) @galipremsagar
- Rename
logical_casttobit_castand allow additional conversions (#7373) @ttnghia - jitify 2 support (#7372) @cwharris
- compile_udf: Cache PTX for similar functions (#7371) @gmarkall
- Add string scalar replace benchmark (#7369) @jlowe
- Add gbenchmark for strings containsre/countre functions (#7366) @davidwendt
- Update orc reader and writer fuzz tests (#7357) @galipremsagar
- Improve url_decode performance for long strings (#7353) @jlowe
cudf::astSmall Refactorings (#7352) @codereport- Remove std::cout and print in the scatter test function EmptyListsOfNullableStrings. (#7342) @ttnghia
- Use
cudf::detail::make_counting_transform_iterator(#7338) @codereport - Change block size parameter from a global to a template param. (#7333) @nvdbaranec
- Partial clean up of ORC writer (#7324) @vuule
- Add gbenchmark for cudf::strings::to_lower (#7316) @davidwendt
- Update Java bindings version to 0.19-SNAPSHOT (#7307) @pxLi
- Move
cudf::test::make_counting_transform_iteratortocudf/detail/iterator.cuh(#7306) @codereport - Use string literals in
fixed_pointrelease_asserts (#7303) @codereport - Fix merge conflicts for #7295 (#7297) @ajschmidt8
- Add UTF-8 chars to createrandomcolumn<string_view> benchmark utility (#7292) @davidwendt
- Abstracting block reduce and block scan from cuIO kernels with
cubapis (#7278) @rgsl888prabhu - Build.sh use cmake --build to drive build system invocation (#7270) @robertmaynard
- Refactor dictionary support for reductions any/all (#7242) @davidwendt
- Replace stream.value() with stream for stream_view args (#7236) @karthikeyann
- Interval index and interval_range (#7182) @marlenezw
- avro reader integration tests (#7156) @cwharris
- Rework libcudf CMakeLists.txt to export targets for CPM (#7107) @trxcllnt
- Adding Interval Dtype (#6984) @marlenezw
- Cleaning up
forloops withmake_(counting_)transform_iterator(#6546) @codereport
- C++
Published by GPUtester almost 5 years ago
https://github.com/rapidsai/cudf - v0.19.1
π¨ Breaking Changes
- Allow hash_partition to take a seed value (#7771) @magnatelee
- Allow merging index column with data column using keyword "on" (#7736) @skirui-source
- Change JNI API to avoid loading native dependencies when creating sort order classes. (#7729) @revans2
- Replace devicevector with deviceuvector in null_mask (#7715) @harrism
- Don't identify decimals as strings. (#7710) @vyasr
- Fix Java Parquet write after writer API changes (#7655) @revans2
- Convert cudf::concatenate APIs to use spans and device_uvector (#7621) @harrism
- Update missing docstring examples in python public APIs (#7546) @galipremsagar
- Remove unneeded step parameter from strings::detail::copy_slice (#7525) @davidwendt
- Rename ARROWSTATICLIB because it conflicts with one in FindArrow.cmake (#7518) @trxcllnt
- Match Pandas logic for comparing two objects with nulls (#7490) @brandon-b-miller
- Add struct support to parquet writer (#7461) @devavret
- Join APIs that return gathermaps (#7454) @shwina
fixed_point+cudf::binary_operationAPI Changes (#7435) @codereport- Fix BUG: Exception when PYTHONOPTIMIZE=2 (#7434) @skirui-source
- Change nvtext::loadvocabularyfile to return a unique ptr (#7424) @davidwendt
- Refactor strings column factories (#7397) @harrism
- Use CMAKECUDAARCHITECTURES (#7391) @robertmaynard
- Upgrade pandas to 1.2 (#7375) @galipremsagar
- Rename
logical_casttobit_castand allow additional conversions (#7373) @ttnghia - Rework libcudf CMakeLists.txt to export targets for CPM (#7107) @trxcllnt
π Bug Fixes
- Fix returned column type when extracting from an empty list column (#8031) @jlowe
- Don't reindex an new value on setitem if the original dataframe was empty (#8026) @vyasr
- Fix a
NameErrorin meta dispatch API (#7996) @galipremsagar - Reindex in
DataFrame.__setitem__(#7957) @galipremsagar - jitify direct-to-cubin compilation and caching. (#7919) @cwharris
- Use dynamic cudart for nvcomp in java build (#7896) @abellina
- fix "incompatible redefinition" warnings (#7894) @cwharris
- cudf consistently specifies the cuda runtime (#7887) @robertmaynard
- disable verbose output for jitify_preprocess (#7886) @cwharris
- CMake jitpreprocessfiles function only runs when needed (#7872) @robertmaynard
- Push DeviceScalar construction into cython for list.contains (#7864) @brandon-b-miller
- cudf now sets an install rpath of $ORIGIN (#7863) @robertmaynard
- Don't install Thrust examples, tests, docs, and python files (#7811) @robertmaynard
- Sort by index in groupby tests more consistently (#7802) @shwina
- Revert "Update conda recipes pinning of repo dependencies (#7743)" (#7793) @raydouglass
- Add decimal column handling in copytypemetadata (#7788) @shwina
- Add column names validation in parquet writer (#7786) @galipremsagar
- Fix Java explode outer unit tests (#7782) @jlowe
- Fix compiler warning about non-POD types passed through ellipsis (#7781) @jrhemstad
- User resource fix for replace_nulls (#7769) @magnatelee
- Fix type dispatch for columnar replace_nulls (#7768) @jlowe
- Add
ignore_orderparameter to dask-cudf concat dispatch (#7765) @galipremsagar - Fix slicing and arrow representations of decimal columns (#7755) @vyasr
- Fixing issue with explode_outer position not nulling position entries of null rows (#7754) @hyperbolic2346
- Implement scatter for struct columns (#7752) @ttnghia
- Fix data corruption in string columns (#7746) @galipremsagar
- Fix string length in stripe dictionary building (#7744) @kaatish
- Update conda recipes pinning of repo dependencies (#7743) @mike-wendt
- Enable dask dispatch to cuDF's
is_categorical_dtypefor cuDF objects (#7740) @brandon-b-miller - Fix dictionary size computation in ORC writer (#7737) @vuule
- Fix
cudf::castoverflow fordecimal64toint32_tor smaller in certain cases (#7733) @codereport - Change JNI API to avoid loading native dependencies when creating sort order classes. (#7729) @revans2
- Disable column_view data accessors for unsupported types (#7725) @jrhemstad
- Materialize
RangeIndexwhenindex=Truein parquet writer (#7711) @galipremsagar - Don't identify decimals as strings. (#7710) @vyasr
- Fix return type of
DataFrame.argsort(#7706) @galipremsagar - Fix/correct cudf installed package requirements (#7688) @robertmaynard
- Fix SparkMurmurHash3_32 hash inconsistencies with Apache Spark (#7672) @jlowe
- Fix ORC reader issue with reading empty string columns (#7656) @rgsl888prabhu
- Fix Java Parquet write after writer API changes (#7655) @revans2
- Fixing empty null lists throwing explode_outer for a loop. (#7649) @hyperbolic2346
- Fix internal compiler error during JNI Docker build (#7645) @jlowe
- Fix Debug build break with deviceuvectors in groupedrolling.cu (#7633) @mythrocks
- Parquet reader: Fix issue when using skip_rows on non-nested columns containing nulls (#7627) @nvdbaranec
- Fix ORC reader for empty DataFrame/Table (#7624) @rgsl888prabhu
- Fix specifying GPU architecture in JNI build (#7612) @jlowe
- Fix ORC writer OOM issue (#7605) @vuule
- Fix 0.18 --> 0.19 automerge (#7589) @kkraus14
- Fix ORC issue with incorrect timestamp nanosecond values (#7581) @vuule
- Fix missing Dask imports (#7580) @kkraus14
- CMAKECUDAARCHITECTURES doesn't change when build-system invokes cmake (#7579) @robertmaynard
- Another fix for offsetsend() iterator in listscolumn_view (#7575) @ttnghia
- Fix ORC writer output corruption with string columns (#7565) @vuule
- Fix cudf::lists::sort_lists failing for sliced column (#7564) @ttnghia
- FIX Fix Anaconda upload args (#7558) @dillon-cullinan
- Fix index mismatch issue in equality related APIs (#7555) @galipremsagar
- FIX Revert gpucicondaretry on conda file output locations (#7552) @dillon-cullinan
- Fix offsetend iterator for listscolumn_view, which was not correctl⦠(#7551) @ttnghia
- Fix no such file dlpack.h error when build libcudf (#7549) @chenrui17
- Update missing docstring examples in python public APIs (#7546) @galipremsagar
- Decimal32 Build Fix (#7544) @razajafri
- FIX Retry conda output location (#7540) @dillon-cullinan
- fix missing renames of dask git branches from master to main (#7535) @kkraus14
- Remove detail from device_span (#7533) @rwlee
- Change dask and distributed branch to main (#7532) @dantegd
- Update JNI build to use CUDFUSEARROW_STATIC (#7526) @jlowe
- Make sure rmm::rmm CMake target is visibile to cudf users (#7524) @robertmaynard
- Fix contiguous_split not properly handling output partitions > 2 GB. (#7515) @nvdbaranec
- Change jit launch to safe_launch (#7510) @devavret
- Fix comparison between Datetime/Timedelta columns and NULL scalars (#7504) @brandon-b-miller
- Fix off-by-one error in char-parallel string scalar replace (#7502) @jlowe
- Fix JNI deprecation of all, put it on the wrong version before (#7501) @revans2
- Fix Series/Dataframe Mixed Arithmetic (#7491) @brandon-b-miller
- Fix JNI build after removal of libcudf sub-libraries (#7486) @jlowe
- Correctly compile benchmarks (#7485) @robertmaynard
- Fix bool column corruption with ORC Reader (#7483) @rgsl888prabhu
- Fix
__repr__for categorical dtype (#7476) @galipremsagar - Java cleaner synchronization (#7474) @abellina
- Fix java float/double parsing tests (#7473) @revans2
- Pass stream and user resource to makedefaultconstructed_scalar (#7469) @magnatelee
- Improve stability of daskcudf.DataFrame.var and daskcudf.DataFrame.std (#7453) @rjzamora
- Missing
device_storage_dispatchchange affectingcudf::gather(#7449) @codereport - fix cuFile JNI compile errors (#7445) @rongou
- Support
Series.__setitem__with key to a new row (#7443) @isVoid - Fix BUG: Exception when PYTHONOPTIMIZE=2 (#7434) @skirui-source
- Make inclusive scan safe for cases with leading nulls (#7432) @magnatelee
- Fix typo in listdeviceview::pairrepend() (#7423) @mythrocks
- Fix string to double conversion and row equivalent comparison (#7410) @ttnghia
- Fix thrust failure when transfering data from devicevector to hostvector with vectors of size 1 (#7382) @ttnghia
- Fix std::exeception catch-by-reference gcc9 compile error (#7380) @davidwendt
- Fix skiprows issue with ORC Reader (#7359) @rgsl888prabhu
- fix Arrow CMake file (#7358) @rongou
- Fix lists::contains() for NaN and Decimals (#7349) @mythrocks
- Handle cupy array in
Dataframe.__setitem__(#7340) @galipremsagar - Fix invalid-device-fn error in cudf::strings::replace_re with multiple regex's (#7336) @davidwendt
- FIX Add codecov upload block to gpu script (#6860) @dillon-cullinan
π Documentation
- Fix join API doxygen (#7890) @shwina
- Add Resources to README. (#7697) @bdice
- Add
isinexamples in Docstring (#7479) @galipremsagar - Resolving unlinked type shorthands in cudf doc (#7416) @isVoid
- Fix typo in regex.md doc page (#7363) @davidwendt
- Fix incorrect stringscolumnview::chars_size documentation (#7360) @jlowe
π New Features
- Enable basic reductions for decimal columns (#7776) @ChrisJar
- Enable join on decimal columns (#7764) @ChrisJar
- Allow merging index column with data column using keyword "on" (#7736) @skirui-source
- Implement DecimalColumn + Scalar and add cudf.Scalars of Decimal64Dtype (#7732) @brandon-b-miller
- Add support for
uniquegroupby aggregation (#7726) @shwina - Expose libcudf's label_bins function to cudf (#7724) @vyasr
- Adding support for equi-join on struct (#7720) @hyperbolic2346
- Add decimal column comparison operations (#7716) @isVoid
- Implement scan operations for decimal columns (#7707) @ChrisJar
- Enable typecasting between decimal and int (#7691) @ChrisJar
- Enable decimal support in parquet writer (#7673) @devavret
- Adds
list.uniqueAPI (#7664) @isVoid - Fix NaN handling in droplistduplicates (#7662) @ttnghia
- Add
lists.sort_valuesAPI (#7657) @isVoid - Add is_integer API that can check for the validity of a string-to-integer conversion (#7642) @ttnghia
- Adds
explodeAPI (#7607) @isVoid - Adds
list.take, python binding forcudf::lists::segmented_gather(#7591) @isVoid - Implement cudf::label_bins() (#7554) @vyasr
- Add Python bindings for
lists::contains(#7547) @skirui-source - cudf::rowbitcount() support. (#7534) @nvdbaranec
- Implement droplistduplicates (#7528) @ttnghia
- Add Python bindings for
lists::extract_lists_element(#7505) @skirui-source - Add explodeouter and explodeouter_position (#7499) @hyperbolic2346
- Match Pandas logic for comparing two objects with nulls (#7490) @brandon-b-miller
- Add struct support to parquet writer (#7461) @devavret
- Enable type conversion from float to decimal type (#7450) @ChrisJar
- Add cython for converting strings/fixed-point functions (#7429) @davidwendt
- Add struct column support to cudf::sort and cudf::sorted_order (#7422) @karthikeyann
- Implement groupby collect_set (#7420) @ttnghia
- Merge branch-0.18 into branch-0.19 (#7411) @raydouglass
- Refactor strings column factories (#7397) @harrism
- Add groupby scan operations (sort groupby) (#7387) @karthikeyann
- Add cudf::explode_position (#7376) @hyperbolic2346
- Add string conversion to/from decimal values libcudf APIs (#7364) @davidwendt
- Add groupby SUMOFSQUARES support (#7362) @karthikeyann
- Add
Series.dropapi (#7304) @isVoid - getjsonobject() implementation (#7286) @nvdbaranec
- Python API for
LIstMethods.len()(#7283) @isVoid - Support null_policy::EXCLUDE for COLLECT rolling aggregation (#7264) @mythrocks
- Add support for special tokens in nvtext::subword_tokenizer (#7254) @davidwendt
- Fix inplace update of data and add Series.update (#7201) @galipremsagar
- Implement
cudf::group_by(hash) fordecimal32anddecimal64(#7190) @codereport - Adding support to specify "level" parameter for
Dataframe.rename(#7135) @skirui-source
π οΈ Improvements
- fix GDS include path for version 0.95 (#7877) @rongou
- Update
dask+distributedto2021.4.0(#7858) @jakirkham - Add ability to extract include dirs from
CUDF_HOME(#7848) @galipremsagar - Add USE_GDS as an option in build script (#7833) @pxLi
- add an allocate method with stream in java DeviceMemoryBuffer (#7826) @rongou
- Constrain dask and distributed versions to 2021.3.1 (#7825) @shwina
- Revert dask versioning of concat dispatch (#7823) @galipremsagar
- add copy methods in Java memory buffer (#7791) @rongou
- Update README and CONTRIBUTING for 0.19 (#7778) @robertmaynard
- Allow hash_partition to take a seed value (#7771) @magnatelee
- Turn on NVTX by default in java build (#7761) @tgravescs
- Add Java bindings to join gather map APIs (#7751) @jlowe
- Add replacements column support for Java replaceNulls (#7750) @jlowe
- Add Java bindings for rowbitcount (#7749) @jlowe
- Remove unused JVM array creation (#7748) @jlowe
- Added JNI support for new is_integer (#7739) @revans2
- Create and promote library aliases in libcudf installations (#7734) @trxcllnt
- Support groupby operations for decimal dtypes (#7731) @vyasr
- Memory map the input file only when GDS compatiblity mode is not used (#7717) @vuule
- Replace devicevector with deviceuvector in null_mask (#7715) @harrism
- Struct hashing support for SerialMurmur3 and SparkMurmur3 (#7714) @jlowe
- Add gbenchmark for nvtext replace-tokens function (#7708) @davidwendt
- Use stream in groupby calls (#7705) @karthikeyann
- Update codeowners file (#7701) @ajschmidt8
- Cleanup groupby to use hostspan, devicespan, device_uvector (#7698) @karthikeyann
- Add gbenchmark for nvtext ngrams functions (#7693) @davidwendt
- Misc Python/Cython optimizations (#7686) @shwina
- Add gbenchmark for nvtext tokenize functions (#7684) @davidwendt
- Add columndeviceview to orc writer (#7676) @kaatish
- cudf_kafka now uses cuDF CMake export targets (CPM) (#7674) @robertmaynard
- Add gbenchmark for nvtext normalize functions (#7668) @davidwendt
- Resolve unnecessary import of thrust/optional.hpp in types.hpp (#7667) @vyasr
- Feature/optimize accessor copy (#7660) @vyasr
- Fix
find_package(cudf)(#7658) @trxcllnt - Work-around for gcc7 compile error on Centos7 (#7652) @davidwendt
- Add in JNI support for count_elements (#7651) @revans2
- Fix issues with building cudf in a non-conda environment (#7647) @galipremsagar
- Refactor ConfigureCUDA to not conditionally insert compiler flags (#7643) @robertmaynard
- Add gbenchmark for converting strings to/from timestamps (#7641) @davidwendt
- Handle constructing a
cudf.Scalarfrom acudf.Scalar(#7639) @shwina - Add in JNI support for table partition (#7637) @revans2
- Add explicit fixed_point merge test (#7635) @codereport
- Add JNI support for IDENTITY hash partitioning (#7626) @revans2
- Java support on explode_outer (#7625) @sperlingxx
- Java support of casting string from/to decimal (#7623) @sperlingxx
- Convert cudf::concatenate APIs to use spans and device_uvector (#7621) @harrism
- Add gbenchmark for cudf::strings::translate function (#7617) @davidwendt
- Use file(COPY ) over file(INSTALL ) so cmake output is reduced (#7616) @robertmaynard
- Use rmm::deviceuvector in place of rmm::devicevector for ORC reader/writer and cudf::io::column_buffer (#7614) @vuule
- Refactor Java host-side buffer concatenation to expose separate steps (#7610) @jlowe
- Add gbenchmarks for string substrings functions (#7603) @davidwendt
- Refactor string conversion check (#7599) @ttnghia
- JNI: Pass names of children struct columns to native Arrow IPC writer (#7598) @firestarman
- Revert "ENH Fix stale GHA and prevent duplicates " (#7595) @mike-wendt
- ENH Fix stale GHA and prevent duplicates (#7594) @mike-wendt
- Fix auto-detecting GPU architectures (#7593) @trxcllnt
- Reduce cudf library size (#7583) @robertmaynard
- Optimize cudf::makestringscolumn for long strings (#7576) @davidwendt
- Always build and export the cudf::cudftestutil target (#7574) @trxcllnt
- Eliminate literal parameters to uvector::setelementasync and devicescalar::setvalue (#7563) @harrism
- Add gbenchmark for strings::concatenate (#7560) @davidwendt
- Update Changelog Link (#7550) @ajschmidt8
- Add gbenchmarks for strings replace regex functions (#7541) @davidwendt
- Add
__repr__for Column and ColumnAccessor (#7531) @shwina - Support Decimal DIV changes in cudf (#7527) @razajafri
- Remove unneeded step parameter from strings::detail::copy_slice (#7525) @davidwendt
- Use deviceuvector, devicespan in sort groupby (#7523) @karthikeyann
- Add gbenchmarks for strings extract function (#7522) @davidwendt
- Rename ARROWSTATICLIB because it conflicts with one in FindArrow.cmake (#7518) @trxcllnt
- Reduce compile time/size for scan.cu (#7516) @davidwendt
- Change devicevector to deviceuvector in nvtext source files (#7512) @davidwendt
- Removed unneeded includes from traits.hpp (#7509) @davidwendt
- FIX Remove random build directory generation for ccache (#7508) @dillon-cullinan
- xfail failing pytest in pandas 1.2.3 (#7507) @galipremsagar
- JNI bit cast (#7493) @revans2
- Combine rolling window function tests (#7480) @mythrocks
- Prepare Changelog for Automation (#7477) @ajschmidt8
- Java support for explode position (#7471) @sperlingxx
- Update 0.18 changelog entry (#7463) @ajschmidt8
- JNI: Support skipping nulls for collect aggregation (#7457) @firestarman
- Join APIs that return gathermaps (#7454) @shwina
- Remove dependence on managed memory for multimap test (#7451) @jrhemstad
- Use cuFile for Parquet IO when available (#7444) @vuule
- Statistics cleanup (#7439) @kaatish
- Add gbenchmarks for strings filter functions (#7438) @davidwendt
fixed_point+cudf::binary_operationAPI Changes (#7435) @codereport- Improve string gather performance (#7433) @jlowe
- Don't use user resource for a temporary allocation in sortbykey (#7431) @magnatelee
- Detail APIs for datetime functions (#7430) @magnatelee
- Replace thrust::maxelement with thrust::reduce in strings findallre (#7428) @davidwendt
- Add gbenchmark for strings split/split_record functions (#7427) @davidwendt
- Update JNI build to use CMAKECUDAARCHITECTURES (#7425) @jlowe
- Change nvtext::loadvocabularyfile to return a unique ptr (#7424) @davidwendt
- Simplify type dispatch with
device_storage_dispatch(#7419) @codereport - Java support for casting of nested child columns (#7417) @razajafri
- Improve scalar string replace performance for long strings (#7415) @jlowe
- Remove unneeded temporary device vector for strings scatter specialization (#7409) @davidwendt
- bitmask_or implementation with bitmask refactor (#7406) @rwlee
- Add other cudf::strings::replace functions to current strings replace gbenchmark (#7403) @davidwendt
- Clean up included headers in
device_operators.cuh(#7401) @codereport - Move nullable index iterator to indexalator factory (#7399) @davidwendt
- ENH Pass ccache variables to conda recipe & use Ninja in CI (#7398) @Ethyling
- upgrade maven-antrun-plugin to support maven parallel builds (#7393) @rongou
- Add gbenchmark for strings find/contains functions (#7392) @davidwendt
- Use CMAKECUDAARCHITECTURES (#7391) @robertmaynard
- Refactor libcudf strings::replace to use makestringschildren utility (#7384) @davidwendt
- Added in JNI support for out of core sort algorithm (#7381) @revans2
- Upgrade pandas to 1.2 (#7375) @galipremsagar
- Rename
logical_casttobit_castand allow additional conversions (#7373) @ttnghia - jitify 2 support (#7372) @cwharris
- compile_udf: Cache PTX for similar functions (#7371) @gmarkall
- Add string scalar replace benchmark (#7369) @jlowe
- Add gbenchmark for strings containsre/countre functions (#7366) @davidwendt
- Update orc reader and writer fuzz tests (#7357) @galipremsagar
- Improve url_decode performance for long strings (#7353) @jlowe
cudf::astSmall Refactorings (#7352) @codereport- Remove std::cout and print in the scatter test function EmptyListsOfNullableStrings. (#7342) @ttnghia
- Use
cudf::detail::make_counting_transform_iterator(#7338) @codereport - Change block size parameter from a global to a template param. (#7333) @nvdbaranec
- Partial clean up of ORC writer (#7324) @vuule
- Add gbenchmark for cudf::strings::to_lower (#7316) @davidwendt
- Update Java bindings version to 0.19-SNAPSHOT (#7307) @pxLi
- Move
cudf::test::make_counting_transform_iteratortocudf/detail/iterator.cuh(#7306) @codereport - Use string literals in
fixed_pointrelease_asserts (#7303) @codereport - Fix merge conflicts for #7295 (#7297) @ajschmidt8
- Add UTF-8 chars to createrandomcolumn<string_view> benchmark utility (#7292) @davidwendt
- Abstracting block reduce and block scan from cuIO kernels with
cubapis (#7278) @rgsl888prabhu - Build.sh use cmake --build to drive build system invocation (#7270) @robertmaynard
- Refactor dictionary support for reductions any/all (#7242) @davidwendt
- Replace stream.value() with stream for stream_view args (#7236) @karthikeyann
- Interval index and interval_range (#7182) @marlenezw
- avro reader integration tests (#7156) @cwharris
- Rework libcudf CMakeLists.txt to export targets for CPM (#7107) @trxcllnt
- Adding Interval Dtype (#6984) @marlenezw
- Cleaning up
forloops withmake_(counting_)transform_iterator(#6546) @codereport
- C++
Published by GPUtester almost 5 years ago
https://github.com/rapidsai/cudf - v0.19.0
π¨ Breaking Changes
- Allow hash_partition to take a seed value (#7771) @magnatelee
- Allow merging index column with data column using keyword "on" (#7736) @skirui-source
- Change JNI API to avoid loading native dependencies when creating sort order classes. (#7729) @revans2
- Replace devicevector with deviceuvector in null_mask (#7715) @harrism
- Don't identify decimals as strings. (#7710) @vyasr
- Fix Java Parquet write after writer API changes (#7655) @revans2
- Convert cudf::concatenate APIs to use spans and device_uvector (#7621) @harrism
- Update missing docstring examples in python public APIs (#7546) @galipremsagar
- Remove unneeded step parameter from strings::detail::copy_slice (#7525) @davidwendt
- Rename ARROWSTATICLIB because it conflicts with one in FindArrow.cmake (#7518) @trxcllnt
- Match Pandas logic for comparing two objects with nulls (#7490) @brandon-b-miller
- Add struct support to parquet writer (#7461) @devavret
- Join APIs that return gathermaps (#7454) @shwina
fixed_point+cudf::binary_operationAPI Changes (#7435) @codereport- Fix BUG: Exception when PYTHONOPTIMIZE=2 (#7434) @skirui-source
- Change nvtext::loadvocabularyfile to return a unique ptr (#7424) @davidwendt
- Refactor strings column factories (#7397) @harrism
- Use CMAKECUDAARCHITECTURES (#7391) @robertmaynard
- Upgrade pandas to 1.2 (#7375) @galipremsagar
- Rename
logical_casttobit_castand allow additional conversions (#7373) @ttnghia - Rework libcudf CMakeLists.txt to export targets for CPM (#7107) @trxcllnt
π Bug Fixes
- Fix a
NameErrorin meta dispatch API (#7996) @galipremsagar - Reindex in
DataFrame.__setitem__(#7957) @galipremsagar - jitify direct-to-cubin compilation and caching. (#7919) @cwharris
- Use dynamic cudart for nvcomp in java build (#7896) @abellina
- fix "incompatible redefinition" warnings (#7894) @cwharris
- cudf consistently specifies the cuda runtime (#7887) @robertmaynard
- disable verbose output for jitify_preprocess (#7886) @cwharris
- CMake jitpreprocessfiles function only runs when needed (#7872) @robertmaynard
- Push DeviceScalar construction into cython for list.contains (#7864) @brandon-b-miller
- cudf now sets an install rpath of $ORIGIN (#7863) @robertmaynard
- Don't install Thrust examples, tests, docs, and python files (#7811) @robertmaynard
- Sort by index in groupby tests more consistently (#7802) @shwina
- Revert "Update conda recipes pinning of repo dependencies (#7743)" (#7793) @raydouglass
- Add decimal column handling in copytypemetadata (#7788) @shwina
- Add column names validation in parquet writer (#7786) @galipremsagar
- Fix Java explode outer unit tests (#7782) @jlowe
- Fix compiler warning about non-POD types passed through ellipsis (#7781) @jrhemstad
- User resource fix for replace_nulls (#7769) @magnatelee
- Fix type dispatch for columnar replace_nulls (#7768) @jlowe
- Add
ignore_orderparameter to dask-cudf concat dispatch (#7765) @galipremsagar - Fix slicing and arrow representations of decimal columns (#7755) @vyasr
- Fixing issue with explode_outer position not nulling position entries of null rows (#7754) @hyperbolic2346
- Implement scatter for struct columns (#7752) @ttnghia
- Fix data corruption in string columns (#7746) @galipremsagar
- Fix string length in stripe dictionary building (#7744) @kaatish
- Update conda recipes pinning of repo dependencies (#7743) @mike-wendt
- Enable dask dispatch to cuDF's
is_categorical_dtypefor cuDF objects (#7740) @brandon-b-miller - Fix dictionary size computation in ORC writer (#7737) @vuule
- Fix
cudf::castoverflow fordecimal64toint32_tor smaller in certain cases (#7733) @codereport - Change JNI API to avoid loading native dependencies when creating sort order classes. (#7729) @revans2
- Disable column_view data accessors for unsupported types (#7725) @jrhemstad
- Materialize
RangeIndexwhenindex=Truein parquet writer (#7711) @galipremsagar - Don't identify decimals as strings. (#7710) @vyasr
- Fix return type of
DataFrame.argsort(#7706) @galipremsagar - Fix/correct cudf installed package requirements (#7688) @robertmaynard
- Fix SparkMurmurHash3_32 hash inconsistencies with Apache Spark (#7672) @jlowe
- Fix ORC reader issue with reading empty string columns (#7656) @rgsl888prabhu
- Fix Java Parquet write after writer API changes (#7655) @revans2
- Fixing empty null lists throwing explode_outer for a loop. (#7649) @hyperbolic2346
- Fix internal compiler error during JNI Docker build (#7645) @jlowe
- Fix Debug build break with deviceuvectors in groupedrolling.cu (#7633) @mythrocks
- Parquet reader: Fix issue when using skip_rows on non-nested columns containing nulls (#7627) @nvdbaranec
- Fix ORC reader for empty DataFrame/Table (#7624) @rgsl888prabhu
- Fix specifying GPU architecture in JNI build (#7612) @jlowe
- Fix ORC writer OOM issue (#7605) @vuule
- Fix 0.18 --> 0.19 automerge (#7589) @kkraus14
- Fix ORC issue with incorrect timestamp nanosecond values (#7581) @vuule
- Fix missing Dask imports (#7580) @kkraus14
- CMAKECUDAARCHITECTURES doesn't change when build-system invokes cmake (#7579) @robertmaynard
- Another fix for offsetsend() iterator in listscolumn_view (#7575) @ttnghia
- Fix ORC writer output corruption with string columns (#7565) @vuule
- Fix cudf::lists::sort_lists failing for sliced column (#7564) @ttnghia
- FIX Fix Anaconda upload args (#7558) @dillon-cullinan
- Fix index mismatch issue in equality related APIs (#7555) @galipremsagar
- FIX Revert gpucicondaretry on conda file output locations (#7552) @dillon-cullinan
- Fix offsetend iterator for listscolumn_view, which was not correctl⦠(#7551) @ttnghia
- Fix no such file dlpack.h error when build libcudf (#7549) @chenrui17
- Update missing docstring examples in python public APIs (#7546) @galipremsagar
- Decimal32 Build Fix (#7544) @razajafri
- FIX Retry conda output location (#7540) @dillon-cullinan
- fix missing renames of dask git branches from master to main (#7535) @kkraus14
- Remove detail from device_span (#7533) @rwlee
- Change dask and distributed branch to main (#7532) @dantegd
- Update JNI build to use CUDFUSEARROW_STATIC (#7526) @jlowe
- Make sure rmm::rmm CMake target is visibile to cudf users (#7524) @robertmaynard
- Fix contiguous_split not properly handling output partitions > 2 GB. (#7515) @nvdbaranec
- Change jit launch to safe_launch (#7510) @devavret
- Fix comparison between Datetime/Timedelta columns and NULL scalars (#7504) @brandon-b-miller
- Fix off-by-one error in char-parallel string scalar replace (#7502) @jlowe
- Fix JNI deprecation of all, put it on the wrong version before (#7501) @revans2
- Fix Series/Dataframe Mixed Arithmetic (#7491) @brandon-b-miller
- Fix JNI build after removal of libcudf sub-libraries (#7486) @jlowe
- Correctly compile benchmarks (#7485) @robertmaynard
- Fix bool column corruption with ORC Reader (#7483) @rgsl888prabhu
- Fix
__repr__for categorical dtype (#7476) @galipremsagar - Java cleaner synchronization (#7474) @abellina
- Fix java float/double parsing tests (#7473) @revans2
- Pass stream and user resource to makedefaultconstructed_scalar (#7469) @magnatelee
- Improve stability of daskcudf.DataFrame.var and daskcudf.DataFrame.std (#7453) @rjzamora
- Missing
device_storage_dispatchchange affectingcudf::gather(#7449) @codereport - fix cuFile JNI compile errors (#7445) @rongou
- Support
Series.__setitem__with key to a new row (#7443) @isVoid - Fix BUG: Exception when PYTHONOPTIMIZE=2 (#7434) @skirui-source
- Make inclusive scan safe for cases with leading nulls (#7432) @magnatelee
- Fix typo in listdeviceview::pairrepend() (#7423) @mythrocks
- Fix string to double conversion and row equivalent comparison (#7410) @ttnghia
- Fix thrust failure when transfering data from devicevector to hostvector with vectors of size 1 (#7382) @ttnghia
- Fix std::exeception catch-by-reference gcc9 compile error (#7380) @davidwendt
- Fix skiprows issue with ORC Reader (#7359) @rgsl888prabhu
- fix Arrow CMake file (#7358) @rongou
- Fix lists::contains() for NaN and Decimals (#7349) @mythrocks
- Handle cupy array in
Dataframe.__setitem__(#7340) @galipremsagar - Fix invalid-device-fn error in cudf::strings::replace_re with multiple regex's (#7336) @davidwendt
- FIX Add codecov upload block to gpu script (#6860) @dillon-cullinan
π Documentation
- Fix join API doxygen (#7890) @shwina
- Add Resources to README. (#7697) @bdice
- Add
isinexamples in Docstring (#7479) @galipremsagar - Resolving unlinked type shorthands in cudf doc (#7416) @isVoid
- Fix typo in regex.md doc page (#7363) @davidwendt
- Fix incorrect stringscolumnview::chars_size documentation (#7360) @jlowe
π New Features
- Enable basic reductions for decimal columns (#7776) @ChrisJar
- Enable join on decimal columns (#7764) @ChrisJar
- Allow merging index column with data column using keyword "on" (#7736) @skirui-source
- Implement DecimalColumn + Scalar and add cudf.Scalars of Decimal64Dtype (#7732) @brandon-b-miller
- Add support for
uniquegroupby aggregation (#7726) @shwina - Expose libcudf's label_bins function to cudf (#7724) @vyasr
- Adding support for equi-join on struct (#7720) @hyperbolic2346
- Add decimal column comparison operations (#7716) @isVoid
- Implement scan operations for decimal columns (#7707) @ChrisJar
- Enable typecasting between decimal and int (#7691) @ChrisJar
- Enable decimal support in parquet writer (#7673) @devavret
- Adds
list.uniqueAPI (#7664) @isVoid - Fix NaN handling in droplistduplicates (#7662) @ttnghia
- Add
lists.sort_valuesAPI (#7657) @isVoid - Add is_integer API that can check for the validity of a string-to-integer conversion (#7642) @ttnghia
- Adds
explodeAPI (#7607) @isVoid - Adds
list.take, python binding forcudf::lists::segmented_gather(#7591) @isVoid - Implement cudf::label_bins() (#7554) @vyasr
- Add Python bindings for
lists::contains(#7547) @skirui-source - cudf::rowbitcount() support. (#7534) @nvdbaranec
- Implement droplistduplicates (#7528) @ttnghia
- Add Python bindings for
lists::extract_lists_element(#7505) @skirui-source - Add explodeouter and explodeouter_position (#7499) @hyperbolic2346
- Match Pandas logic for comparing two objects with nulls (#7490) @brandon-b-miller
- Add struct support to parquet writer (#7461) @devavret
- Enable type conversion from float to decimal type (#7450) @ChrisJar
- Add cython for converting strings/fixed-point functions (#7429) @davidwendt
- Add struct column support to cudf::sort and cudf::sorted_order (#7422) @karthikeyann
- Implement groupby collect_set (#7420) @ttnghia
- Merge branch-0.18 into branch-0.19 (#7411) @raydouglass
- Refactor strings column factories (#7397) @harrism
- Add groupby scan operations (sort groupby) (#7387) @karthikeyann
- Add cudf::explode_position (#7376) @hyperbolic2346
- Add string conversion to/from decimal values libcudf APIs (#7364) @davidwendt
- Add groupby SUMOFSQUARES support (#7362) @karthikeyann
- Add
Series.dropapi (#7304) @isVoid - getjsonobject() implementation (#7286) @nvdbaranec
- Python API for
LIstMethods.len()(#7283) @isVoid - Support null_policy::EXCLUDE for COLLECT rolling aggregation (#7264) @mythrocks
- Add support for special tokens in nvtext::subword_tokenizer (#7254) @davidwendt
- Fix inplace update of data and add Series.update (#7201) @galipremsagar
- Implement
cudf::group_by(hash) fordecimal32anddecimal64(#7190) @codereport - Adding support to specify "level" parameter for
Dataframe.rename(#7135) @skirui-source
π οΈ Improvements
- fix GDS include path for version 0.95 (#7877) @rongou
- Update
dask+distributedto2021.4.0(#7858) @jakirkham - Add ability to extract include dirs from
CUDF_HOME(#7848) @galipremsagar - Add USE_GDS as an option in build script (#7833) @pxLi
- add an allocate method with stream in java DeviceMemoryBuffer (#7826) @rongou
- Constrain dask and distributed versions to 2021.3.1 (#7825) @shwina
- Revert dask versioning of concat dispatch (#7823) @galipremsagar
- add copy methods in Java memory buffer (#7791) @rongou
- Update README and CONTRIBUTING for 0.19 (#7778) @robertmaynard
- Allow hash_partition to take a seed value (#7771) @magnatelee
- Turn on NVTX by default in java build (#7761) @tgravescs
- Add Java bindings to join gather map APIs (#7751) @jlowe
- Add replacements column support for Java replaceNulls (#7750) @jlowe
- Add Java bindings for rowbitcount (#7749) @jlowe
- Remove unused JVM array creation (#7748) @jlowe
- Added JNI support for new is_integer (#7739) @revans2
- Create and promote library aliases in libcudf installations (#7734) @trxcllnt
- Support groupby operations for decimal dtypes (#7731) @vyasr
- Memory map the input file only when GDS compatiblity mode is not used (#7717) @vuule
- Replace devicevector with deviceuvector in null_mask (#7715) @harrism
- Struct hashing support for SerialMurmur3 and SparkMurmur3 (#7714) @jlowe
- Add gbenchmark for nvtext replace-tokens function (#7708) @davidwendt
- Use stream in groupby calls (#7705) @karthikeyann
- Update codeowners file (#7701) @ajschmidt8
- Cleanup groupby to use hostspan, devicespan, device_uvector (#7698) @karthikeyann
- Add gbenchmark for nvtext ngrams functions (#7693) @davidwendt
- Misc Python/Cython optimizations (#7686) @shwina
- Add gbenchmark for nvtext tokenize functions (#7684) @davidwendt
- Add columndeviceview to orc writer (#7676) @kaatish
- cudf_kafka now uses cuDF CMake export targets (CPM) (#7674) @robertmaynard
- Add gbenchmark for nvtext normalize functions (#7668) @davidwendt
- Resolve unnecessary import of thrust/optional.hpp in types.hpp (#7667) @vyasr
- Feature/optimize accessor copy (#7660) @vyasr
- Fix
find_package(cudf)(#7658) @trxcllnt - Work-around for gcc7 compile error on Centos7 (#7652) @davidwendt
- Add in JNI support for count_elements (#7651) @revans2
- Fix issues with building cudf in a non-conda environment (#7647) @galipremsagar
- Refactor ConfigureCUDA to not conditionally insert compiler flags (#7643) @robertmaynard
- Add gbenchmark for converting strings to/from timestamps (#7641) @davidwendt
- Handle constructing a
cudf.Scalarfrom acudf.Scalar(#7639) @shwina - Add in JNI support for table partition (#7637) @revans2
- Add explicit fixed_point merge test (#7635) @codereport
- Add JNI support for IDENTITY hash partitioning (#7626) @revans2
- Java support on explode_outer (#7625) @sperlingxx
- Java support of casting string from/to decimal (#7623) @sperlingxx
- Convert cudf::concatenate APIs to use spans and device_uvector (#7621) @harrism
- Add gbenchmark for cudf::strings::translate function (#7617) @davidwendt
- Use file(COPY ) over file(INSTALL ) so cmake output is reduced (#7616) @robertmaynard
- Use rmm::deviceuvector in place of rmm::devicevector for ORC reader/writer and cudf::io::column_buffer (#7614) @vuule
- Refactor Java host-side buffer concatenation to expose separate steps (#7610) @jlowe
- Add gbenchmarks for string substrings functions (#7603) @davidwendt
- Refactor string conversion check (#7599) @ttnghia
- JNI: Pass names of children struct columns to native Arrow IPC writer (#7598) @firestarman
- Revert "ENH Fix stale GHA and prevent duplicates " (#7595) @mike-wendt
- ENH Fix stale GHA and prevent duplicates (#7594) @mike-wendt
- Fix auto-detecting GPU architectures (#7593) @trxcllnt
- Reduce cudf library size (#7583) @robertmaynard
- Optimize cudf::makestringscolumn for long strings (#7576) @davidwendt
- Always build and export the cudf::cudftestutil target (#7574) @trxcllnt
- Eliminate literal parameters to uvector::setelementasync and devicescalar::setvalue (#7563) @harrism
- Add gbenchmark for strings::concatenate (#7560) @davidwendt
- Update Changelog Link (#7550) @ajschmidt8
- Add gbenchmarks for strings replace regex functions (#7541) @davidwendt
- Add
__repr__for Column and ColumnAccessor (#7531) @shwina - Support Decimal DIV changes in cudf (#7527) @razajafri
- Remove unneeded step parameter from strings::detail::copy_slice (#7525) @davidwendt
- Use deviceuvector, devicespan in sort groupby (#7523) @karthikeyann
- Add gbenchmarks for strings extract function (#7522) @davidwendt
- Rename ARROWSTATICLIB because it conflicts with one in FindArrow.cmake (#7518) @trxcllnt
- Reduce compile time/size for scan.cu (#7516) @davidwendt
- Change devicevector to deviceuvector in nvtext source files (#7512) @davidwendt
- Removed unneeded includes from traits.hpp (#7509) @davidwendt
- FIX Remove random build directory generation for ccache (#7508) @dillon-cullinan
- xfail failing pytest in pandas 1.2.3 (#7507) @galipremsagar
- JNI bit cast (#7493) @revans2
- Combine rolling window function tests (#7480) @mythrocks
- Prepare Changelog for Automation (#7477) @ajschmidt8
- Java support for explode position (#7471) @sperlingxx
- Update 0.18 changelog entry (#7463) @ajschmidt8
- JNI: Support skipping nulls for collect aggregation (#7457) @firestarman
- Join APIs that return gathermaps (#7454) @shwina
- Remove dependence on managed memory for multimap test (#7451) @jrhemstad
- Use cuFile for Parquet IO when available (#7444) @vuule
- Statistics cleanup (#7439) @kaatish
- Add gbenchmarks for strings filter functions (#7438) @davidwendt
fixed_point+cudf::binary_operationAPI Changes (#7435) @codereport- Improve string gather performance (#7433) @jlowe
- Don't use user resource for a temporary allocation in sortbykey (#7431) @magnatelee
- Detail APIs for datetime functions (#7430) @magnatelee
- Replace thrust::maxelement with thrust::reduce in strings findallre (#7428) @davidwendt
- Add gbenchmark for strings split/split_record functions (#7427) @davidwendt
- Update JNI build to use CMAKECUDAARCHITECTURES (#7425) @jlowe
- Change nvtext::loadvocabularyfile to return a unique ptr (#7424) @davidwendt
- Simplify type dispatch with
device_storage_dispatch(#7419) @codereport - Java support for casting of nested child columns (#7417) @razajafri
- Improve scalar string replace performance for long strings (#7415) @jlowe
- Remove unneeded temporary device vector for strings scatter specialization (#7409) @davidwendt
- bitmask_or implementation with bitmask refactor (#7406) @rwlee
- Add other cudf::strings::replace functions to current strings replace gbenchmark (#7403) @davidwendt
- Clean up included headers in
device_operators.cuh(#7401) @codereport - Move nullable index iterator to indexalator factory (#7399) @davidwendt
- ENH Pass ccache variables to conda recipe & use Ninja in CI (#7398) @Ethyling
- upgrade maven-antrun-plugin to support maven parallel builds (#7393) @rongou
- Add gbenchmark for strings find/contains functions (#7392) @davidwendt
- Use CMAKECUDAARCHITECTURES (#7391) @robertmaynard
- Refactor libcudf strings::replace to use makestringschildren utility (#7384) @davidwendt
- Added in JNI support for out of core sort algorithm (#7381) @revans2
- Upgrade pandas to 1.2 (#7375) @galipremsagar
- Rename
logical_casttobit_castand allow additional conversions (#7373) @ttnghia - jitify 2 support (#7372) @cwharris
- compile_udf: Cache PTX for similar functions (#7371) @gmarkall
- Add string scalar replace benchmark (#7369) @jlowe
- Add gbenchmark for strings containsre/countre functions (#7366) @davidwendt
- Update orc reader and writer fuzz tests (#7357) @galipremsagar
- Improve url_decode performance for long strings (#7353) @jlowe
cudf::astSmall Refactorings (#7352) @codereport- Remove std::cout and print in the scatter test function EmptyListsOfNullableStrings. (#7342) @ttnghia
- Use
cudf::detail::make_counting_transform_iterator(#7338) @codereport - Change block size parameter from a global to a template param. (#7333) @nvdbaranec
- Partial clean up of ORC writer (#7324) @vuule
- Add gbenchmark for cudf::strings::to_lower (#7316) @davidwendt
- Update Java bindings version to 0.19-SNAPSHOT (#7307) @pxLi
- Move
cudf::test::make_counting_transform_iteratortocudf/detail/iterator.cuh(#7306) @codereport - Use string literals in
fixed_pointrelease_asserts (#7303) @codereport - Fix merge conflicts for #7295 (#7297) @ajschmidt8
- Add UTF-8 chars to createrandomcolumn<string_view> benchmark utility (#7292) @davidwendt
- Abstracting block reduce and block scan from cuIO kernels with
cubapis (#7278) @rgsl888prabhu - Build.sh use cmake --build to drive build system invocation (#7270) @robertmaynard
- Refactor dictionary support for reductions any/all (#7242) @davidwendt
- Replace stream.value() with stream for stream_view args (#7236) @karthikeyann
- Interval index and interval_range (#7182) @marlenezw
- avro reader integration tests (#7156) @cwharris
- Rework libcudf CMakeLists.txt to export targets for CPM (#7107) @trxcllnt
- Adding Interval Dtype (#6984) @marlenezw
- Cleaning up
forloops withmake_(counting_)transform_iterator(#6546) @codereport
- C++
Published by GPUtester almost 5 years ago
https://github.com/rapidsai/cudf - [NIGHTLY] v0.18.0
π Links
π¨ Breaking Changes
- Default
groupbytosort=False(#7180) @isVoid - Add libcudf API for parsing of ORC statistics (#7136) @vuule
- Replace ORC writer api with class (#7099) @rgsl888prabhu
- Pack/unpack functionality to convert tables to and from a serialized format. (#7096) @nvdbaranec
- Replace parquet writer api with class (#7058) @rgsl888prabhu
- Add days check to cudf::is_timestamp using cuda::std::chrono classes (#7028) @davidwendt
- Fix default parameter values of
write_csvandwrite_parquet(#6967) @vuule - Align
Series.groupbyAPI to match Pandas (#6964) @kkraus14 - Share
factorizeimplementation with Index and cudf module (#6885) @brandon-b-miller
π Bug Fixes
- Fix null-bounds calculation for ranged window queries (#7568) @mythrocks
- Remove incorrect std::move call on return variable (#7319) @davidwendt
- Fix failing CI ORC test (#7313) @vuule
- Disallow constructing frames from a ColumnAccessor (#7298) @shwina
- fix java cuFile tests (#7296) @rongou
- Fix style issues related to NumPy (#7279) @shwina
- Fix bug when
ilocslice terminates at before-the-zero position (#7277) @isVoid - Fix copying dtype metadata after calling libcudf functions (#7271) @shwina
- Move lists utility function definition out of header (#7266) @mythrocks
- Throw if bool column would cause incorrect result when writing to ORC (#7261) @vuule
- Use
uvectorinreplace_nulls; Fixsort_helper::grouped_valuedoc (#7256) @isVoid - Remove floating point types from cudf::sort fast-path (#7250) @davidwendt
- Disallow picking output columns from nested columns. (#7248) @devavret
- Fix
locfor Series with a MultiIndex (#7243) @shwina - Fix Arrow column test leaks (#7241) @tgravescs
- Fix test column vector leak (#7238) @kuhushukla
- Fix some bugs in java scalar support for decimal (#7237) @revans2
- Improve
assert_eqhandling of scalar (#7220) @isVoid - Fix missing null_count() comparison in test framework and related failures (#7219) @nvdbaranec
- Remove floating point types from radix sort fast-path (#7215) @davidwendt
- Fixing parquet benchmarks (#7214) @rgsl888prabhu
- Handle various parameter combinations in
replaceAPI (#7207) @galipremsagar - Export mock aws credentials for s3 tests (#7176) @ayushdg
- Add
MultiIndex.renameAPI (#7172) @isVoid - Fix importing list & struct types in
from_arrow(#7162) @galipremsagar - Fixing parquet precision writing failing if scale is equal to precision (#7146) @hyperbolic2346
- Update s3 tests to use moto_server (#7144) @ayushdg
- Fix JIT cache multi-process test flakiness in slow drives (#7142) @devavret
- Fix compilation errors in libcudf (#7138) @galipremsagar
- Fix compilation failure caused by
-Walladdition. (#7134) @codereport - Add informative error message for
sepin CSV writer (#7095) @galipremsagar - Add JIT cache per compute capability (#7090) @devavret
- Implement
__hash__method for ListDtype (#7081) @galipremsagar - Only upload packages that were built (#7077) @raydouglass
- Fix comparisons between Series and cudf.NA (#7072) @brandon-b-miller
- Handle
nanvalues correctly inSeries.one_hot_encoding(#7059) @galipremsagar - Add
unstack()support for non-multiindexed dataframes (#7054) @isVoid - Fix
read_orcfor decimal type (#7034) @rgsl888prabhu - Fix backward compatibility of loading a 0.16 pkl file (#7033) @galipremsagar
- Decimal casts in JNI became a NOOP (#7032) @revans2
- Restore usual instance/subclass checking to cudf.DateOffset (#7029) @shwina
- Add days check to cudf::is_timestamp using cuda::std::chrono classes (#7028) @davidwendt
- Fix to_csv delimiter handling of timestamp format (#7023) @davidwendt
- Pin librdkakfa to gcc 7 compatible version (#7021) @raydouglass
- Fix
fillna&dropnato also considernp.nanas a missing value (#7019) @galipremsagar - Fix round operator's HALF_EVEN computation for negative integers (#7014) @nartal1
- Skip Thrust sort patch if already applied (#7009) @harrism
- Fix
cudf::hash_partitionfordecimal32anddecimal64(#7006) @codereport - Fix Thrust unroll patch command (#7002) @harrism
- Fix loc behaviour when key of incorrect type is used (#6993) @shwina
- Fix int to datetime conversion in csv_read (#6991) @kaatish
- fix excluding cufile tests by default (#6988) @rongou
- Fix java cufile tests when cufile is not installed (#6987) @revans2
- Make
cudf::roundforfixed_pointwhenscale = -decimal_placesa no-op (#6975) @codereport - Fix type comparison for java (#6970) @revans2
- Fix default parameter values of
write_csvandwrite_parquet(#6967) @vuule - Align
Series.groupbyAPI to match Pandas (#6964) @kkraus14 - Fix timestamp parsing in ORC reader for timezones without transitions (#6959) @vuule
- Fix typo in numerical.py (#6957) @rgsl888prabhu
fixed_point_valuedouble-shifts infixed_pointconstruction (#6950) @codereport- fix libcu++ include path for jni (#6948) @rongou
- Fix groupby agg/apply behaviour when no key columns are provided (#6945) @shwina
- Avoid inserting null elements into join hash table when nulls are treated as unequal (#6943) @hyperbolic2346
- Fix cudf::merge gtest for dictionary columns (#6942) @davidwendt
- Pass numeric scalars of the same dtype through numeric binops (#6938) @brandon-b-miller
- Fix N/A detection for empty fields in CSV reader (#6922) @vuule
- Fix rmm_mode=managed parameter for gtests (#6912) @davidwendt
- Fix nullmask offset handling in parquet and orc writer (#6889) @kaatish
- Correct the sampling range when sampling with replacement (#6884) @ChrisJar
- Handle nested string columns with no children in contiguous_split. (#6864) @nvdbaranec
- Fix
columns&indexhandling in dataframe constructor (#6838) @galipremsagar
π Documentation
- Update readme (#7318) @shwina
- Fix typo in cudf.core.column.string.extract docs (#7253) @adelevie
- Update doxyfile project number (#7161) @davidwendt
- Update 10 minutes to cuDF and CuPy with new APIs (#7158) @ChrisJar
- Cross link RMM & libcudf Doxygen docs (#7149) @ajschmidt8
- Add documentation for support dtypes in all IO formats (#7139) @galipremsagar
- Add groupby docs (#7100) @shwina
- Update cudf python docstrings with new null representation (
<NA>) (#7050) @galipremsagar - Make Doxygen comments formatting consistent (#7041) @vuule
- Add docs for working with missing data (#7010) @galipremsagar
- Remove warning in fromdlpack and todlpack methods (#7001) @miguelusque
- libcudf Developer Guide (#6977) @harrism
- Add JNI wrapper for the cuFile API (GDS) (#6940) @rongou
π New Features
- Support
numeric_onlyfield forrank()(#7213) @isVoid - Add support for
cudf::binary_operationTRUE_DIVfordecimal32anddecimal64(#7198) @codereport - Implement COLLECT rolling window aggregation (#7189) @mythrocks
- Add support for array-like inputs in
cudf.get_dummies(#7181) @galipremsagar - Default
groupbytosort=False(#7180) @isVoid - Add libcudf lists column count_elements API (#7173) @davidwendt
- Implement
cudf::group_by(sort) fordecimal32anddecimal64(#7169) @codereport - Add encoding and compression argument to CSV writer (#7168) @VibhuJawa
cudf::rolling_windowSUMsupport fordecimal32anddecimal64(#7147) @codereport- Adding support for explode to cuDF (#7140) @hyperbolic2346
- Add libcudf API for parsing of ORC statistics (#7136) @vuule
- update GDS/cuFile location for 0.9 release (#7131) @rongou
- Add Segmented sort (#7122) @karthikeyann
- Add
cudf::binary_operationNULL_MIN,NULL_MAX&NULL_EQUALSfordecimal32anddecimal64(#7119) @codereport - Add
scaleandvaluemethods tofixed_point(#7109) @codereport - Replace ORC writer api with class (#7099) @rgsl888prabhu
- Pack/unpack functionality to convert tables to and from a serialized format. (#7096) @nvdbaranec
- Improve
digitizeAPI (#7071) @isVoid - Add List types support in data generator (#7064) @galipremsagar
cudf::scansupport fordecimal32anddecimal64(#7063) @codereportcudf::rollingROW_NUMBERsupport fordecimal32anddecimal64(#7061) @codereport- Replace parquet writer api with class (#7058) @rgsl888prabhu
- Support contains() on lists of primitives (#7039) @mythrocks
- Implement
cudf::rollingfordecimal32anddecimal64(#7037) @codereport - Add
ffillandbfillto string columns (#7036) @isVoid - Enable round in cudf for DataFrame and Series (#7022) @ChrisJar
- Extend
replace_nulls_policytostringanddictionarytype (#7004) @isVoid - Add segmentedgather(listcolumn, gather_list) (#7003) @karthikeyann
- Add
methodfield tofillnafor fixed width columns (#6998) @isVoid - Manual merge of branch 0.17 into branch 0.18 (#6995) @shwina
- Implement
cudf::reducefordecimal32anddecimal64(part 2) (#6980) @codereport - Add Ufunc alias look up for appropriate numpy ufunc dispatching (#6973) @VibhuJawa
- Add pytest-xdist to dev environment.yml (#6958) @galipremsagar
- Add
Index.set_namesapi (#6929) @galipremsagar - Add
replace_nullAPI withreplace_policyparameter,fixed_widthcolumn support (#6907) @isVoid - Share
factorizeimplementation with Index and cudf module (#6885) @brandon-b-miller - Implement update() function (#6883) @skirui-source
- Add groupby idxmin, idxmax aggregation (#6856) @karthikeyann
- Implement
cudf::reducefordecimal32anddecimal64(part 1) (#6814) @codereport - Implement cudf.DateOffset for months (#6775) @brandon-b-miller
- Add Python DecimalColumn (#6715) @shwina
- Add dictionary support to libcudf groupby functions (#6585) @davidwendt
π οΈ Improvements
- Update stale GHA with exemptions & new labels (#7395) @mike-wendt
- Add GHA to mark issues/prs as stale/rotten (#7388) @Ethyling
- Unpin from numpy < 1.20 (#7335) @shwina
- Prepare Changelog for Automation (#7309) @galipremsagar
- Prepare Changelog for Automation (#7272) @ajschmidt8
- Add JNI support for converting Arrow buffers to CUDF ColumnVectors (#7222) @tgravescs
- Add coverage for
skiprowsandnum_rowsin parquet reader fuzz testing (#7216) @galipremsagar - Define and implement more behavior for merging on categorical variables (#7209) @brandon-b-miller
- Add CudfSeriesGroupBy to optimize dask_cudf groupby-mean (#7194) @rjzamora
- Add dictionary column support to rolling_window (#7186) @davidwendt
- Modify the semantics of
endpointers in cuIO to match standard library (#7179) @vuule - Adding unit tests for
fixed_pointwith extremely largescales (#7178) @codereport - Fast path single column sort (#7167) @davidwendt
- Fix -Werror=sign-compare errors in device code (#7164) @trxcllnt
- Refactor cudf::string_view host and device code (#7159) @davidwendt
- Enable logic for GPU auto-detection in cudfjni (#7155) @gerashegalov
- Java bindings for Fixed-point type support for Parquet (#7153) @razajafri
- Add Java interface for the new API 'explode' (#7151) @firestarman
- Replace offsets with iterators in cuIO utilities and CSV parser (#7150) @vuule
- Add gbenchmarks for reduction aggregations any() and all() (#7129) @davidwendt
- Update JNI for contiguous_split packed results (#7127) @jlowe
- Add JNI and Java bindings for list_contains (#7125) @kuhushukla
- Add Java unit tests for window aggregate 'collect' (#7121) @firestarman
- verify window operations on decimal with java tests (#7120) @sperlingxx
- Adds in JNI support for creating an list column from existing columns (#7112) @revans2
- Build libcudf with -Wall (#7105) @trxcllnt
- Add columndeviceview pointers to EncColumnDesc (#7097) @kaatish
- Add
pyorcto dev environment (#7085) @galipremsagar - JNI support for creating struct column from existing columns and fixed bug in struct with no children (#7084) @revans2
- Fastpath single strings column in cudf::sort (#7075) @davidwendt
- Upgrade nvcomp to 1.2.1 (#7069) @rongou
- Refactor ORC
ProtobufReaderto make it more extendable (#7055) @vuule - Add Java tests for decimal casts (#7051) @sperlingxx
- Auto-label PRs based on their content (#7044) @jolorunyomi
- Create sort gbenchmark for strings column (#7040) @davidwendt
- Refactor io memory fetches to use hostdevice_vector methods (#7035) @ChrisJar
- Spark Murmur3 hash functionality (#7024) @rwlee
- Fix libcudf strings logic where size_type is used to access INT32 column data (#7020) @davidwendt
- Adding decimal writing support to parquet (#7017) @hyperbolic2346
- Add compression="infer" as default for daskcudf.readcsv (#7013) @rjzamora
- Correct ORC docstring; other minor cuIO improvements (#7012) @vuule
- Reduce number of hostdevice_vector allocations in parquet reader (#7005) @devavret
- Check output size overflow on strings gather (#6997) @davidwendt
- Improve representation of
MultiIndex(#6992) @galipremsagar - Disable some pragma unroll statements in thrust sort.h (#6982) @davidwendt
- Minor
cudf::roundinternal refactoring (#6976) @codereport - Add Java bindings for URL conversion (#6972) @jlowe
- Enable strictdecimaltypes in parquet reading (#6969) @sperlingxx
- Add in basic support to JNI for logical_cast (#6954) @revans2
- Remove duplicate file array_tests.cpp (#6953) @karthikeyann
- Add null mask
fixed_point_column_wrapperconstructors (#6951) @codereport - Update Java bindings version to 0.18-SNAPSHOT (#6949) @jlowe
- Use simplified
rmm::exec_policy(#6939) @harrism - Add null count test for applybooleanmask (#6903) @harrism
- Implement DataFrame.quantile for datetime and timedelta data types (#6902) @ChrisJar
- Remove **kwargs from string/categorical methods (#6750) @shwina
- Refactor rolling.cu to reduce compile time (#6512) @mythrocks
- Add static type checking via Mypy (#6381) @shwina
- Update to official libcu++ on Github (#6275) @trxcllnt
- C++
Published by rapids-bot[bot] almost 5 years ago
https://github.com/rapidsai/cudf - v0.18.0
Breaking Changes π¨
- Default
groupbytosort=False(#7180) @isVoid - Add libcudf API for parsing of ORC statistics (#7136) @vuule
- Replace ORC writer api with class (#7099) @rgsl888prabhu
- Pack/unpack functionality to convert tables to and from a serialized format. (#7096) @nvdbaranec
- Replace parquet writer api with class (#7058) @rgsl888prabhu
- Add days check to cudf::is_timestamp using cuda::std::chrono classes (#7028) @davidwendt
- Fix default parameter values of
write_csvandwrite_parquet(#6967) @vuule - Align
Series.groupbyAPI to match Pandas (#6964) @kkraus14 - Share
factorizeimplementation with Index and cudf module (#6885) @brandon-b-miller
Bug Fixes π
- Remove incorrect std::move call on return variable (#7319) @davidwendt
- Fix failing CI ORC test (#7313) @vuule
- Disallow constructing frames from a ColumnAccessor (#7298) @shwina
- fix java cuFile tests (#7296) @rongou
- Fix style issues related to NumPy (#7279) @shwina
- Fix bug when
ilocslice terminates at before-the-zero position (#7277) @isVoid - Fix copying dtype metadata after calling libcudf functions (#7271) @shwina
- Move lists utility function definition out of header (#7266) @mythrocks
- Throw if bool column would cause incorrect result when writing to ORC (#7261) @vuule
- Use
uvectorinreplace_nulls; Fixsort_helper::grouped_valuedoc (#7256) @isVoid - Remove floating point types from cudf::sort fast-path (#7250) @davidwendt
- Disallow picking output columns from nested columns. (#7248) @devavret
- Fix
locfor Series with a MultiIndex (#7243) @shwina - Fix Arrow column test leaks (#7241) @tgravescs
- Fix test column vector leak (#7238) @kuhushukla
- Fix some bugs in java scalar support for decimal (#7237) @revans2
- Improve
assert_eqhandling of scalar (#7220) @isVoid - Fix missing null_count() comparison in test framework and related failures (#7219) @nvdbaranec
- Remove floating point types from radix sort fast-path (#7215) @davidwendt
- Fixing parquet benchmarks (#7214) @rgsl888prabhu
- Handle various parameter combinations in
replaceAPI (#7207) @galipremsagar - Export mock aws credentials for s3 tests (#7176) @ayushdg
- Add
MultiIndex.renameAPI (#7172) @isVoid - Fix importing list & struct types in
from_arrow(#7162) @galipremsagar - Fixing parquet precision writing failing if scale is equal to precision (#7146) @hyperbolic2346
- Update s3 tests to use moto_server (#7144) @ayushdg
- Fix JIT cache multi-process test flakiness in slow drives (#7142) @devavret
- Fix compilation errors in libcudf (#7138) @galipremsagar
- Fix compilation failure caused by
-Walladdition. (#7134) @codereport - Add informative error message for
sepin CSV writer (#7095) @galipremsagar - Add JIT cache per compute capability (#7090) @devavret
- Implement
__hash__method for ListDtype (#7081) @galipremsagar - Only upload packages that were built (#7077) @raydouglass
- Fix comparisons between Series and cudf.NA (#7072) @brandon-b-miller
- Handle
nanvalues correctly inSeries.one_hot_encoding(#7059) @galipremsagar - Add
unstack()support for non-multiindexed dataframes (#7054) @isVoid - Fix
read_orcfor decimal type (#7034) @rgsl888prabhu - Fix backward compatibility of loading a 0.16 pkl file (#7033) @galipremsagar
- Decimal casts in JNI became a NOOP (#7032) @revans2
- Restore usual instance/subclass checking to cudf.DateOffset (#7029) @shwina
- Add days check to cudf::is_timestamp using cuda::std::chrono classes (#7028) @davidwendt
- Fix to_csv delimiter handling of timestamp format (#7023) @davidwendt
- Pin librdkakfa to gcc 7 compatible version (#7021) @raydouglass
- Fix
fillna&dropnato also considernp.nanas a missing value (#7019) @galipremsagar - Fix round operator's HALF_EVEN computation for negative integers (#7014) @nartal1
- Skip Thrust sort patch if already applied (#7009) @harrism
- Fix
cudf::hash_partitionfordecimal32anddecimal64(#7006) @codereport - Fix Thrust unroll patch command (#7002) @harrism
- Fix loc behaviour when key of incorrect type is used (#6993) @shwina
- Fix int to datetime conversion in csv_read (#6991) @kaatish
- fix excluding cufile tests by default (#6988) @rongou
- Fix java cufile tests when cufile is not installed (#6987) @revans2
- Make
cudf::roundforfixed_pointwhenscale = -decimal_placesa no-op (#6975) @codereport - Fix type comparison for java (#6970) @revans2
- Fix default parameter values of
write_csvandwrite_parquet(#6967) @vuule - Align
Series.groupbyAPI to match Pandas (#6964) @kkraus14 - Fix timestamp parsing in ORC reader for timezones without transitions (#6959) @vuule
- Fix typo in numerical.py (#6957) @rgsl888prabhu
fixed_point_valuedouble-shifts infixed_pointconstruction (#6950) @codereport- fix libcu++ include path for jni (#6948) @rongou
- Fix groupby agg/apply behaviour when no key columns are provided (#6945) @shwina
- Avoid inserting null elements into join hash table when nulls are treated as unequal (#6943) @hyperbolic2346
- Fix cudf::merge gtest for dictionary columns (#6942) @davidwendt
- Pass numeric scalars of the same dtype through numeric binops (#6938) @brandon-b-miller
- Fix N/A detection for empty fields in CSV reader (#6922) @vuule
- Fix rmm_mode=managed parameter for gtests (#6912) @davidwendt
- Fix nullmask offset handling in parquet and orc writer (#6889) @kaatish
- Correct the sampling range when sampling with replacement (#6884) @ChrisJar
- Handle nested string columns with no children in contiguous_split. (#6864) @nvdbaranec
- Fix
columns&indexhandling in dataframe constructor (#6838) @galipremsagar
Documentation π
- Update readme (#7318) @shwina
- Fix typo in cudf.core.column.string.extract docs (#7253) @adelevie
- Update doxyfile project number (#7161) @davidwendt
- Update 10 minutes to cuDF and CuPy with new APIs (#7158) @ChrisJar
- Cross link RMM & libcudf Doxygen docs (#7149) @ajschmidt8
- Add documentation for support dtypes in all IO formats (#7139) @galipremsagar
- Add groupby docs (#7100) @shwina
- Update cudf python docstrings with new null representation (
<NA>) (#7050) @galipremsagar - Make Doxygen comments formatting consistent (#7041) @vuule
- Add docs for working with missing data (#7010) @galipremsagar
- Remove warning in fromdlpack and todlpack methods (#7001) @miguelusque
- libcudf Developer Guide (#6977) @harrism
- Add JNI wrapper for the cuFile API (GDS) (#6940) @rongou
New Features π
- Support
numeric_onlyfield forrank()(#7213) @isVoid - Add support for
cudf::binary_operationTRUE_DIVfordecimal32anddecimal64(#7198) @codereport - Implement COLLECT rolling window aggregation (#7189) @mythrocks
- Add support for array-like inputs in
cudf.get_dummies(#7181) @galipremsagar - Default
groupbytosort=False(#7180) @isVoid - Add libcudf lists column count_elements API (#7173) @davidwendt
- Implement
cudf::group_by(sort) fordecimal32anddecimal64(#7169) @codereport - Add encoding and compression argument to CSV writer (#7168) @VibhuJawa
cudf::rolling_windowSUMsupport fordecimal32anddecimal64(#7147) @codereport- Adding support for explode to cuDF (#7140) @hyperbolic2346
- Add libcudf API for parsing of ORC statistics (#7136) @vuule
- update GDS/cuFile location for 0.9 release (#7131) @rongou
- Add Segmented sort (#7122) @karthikeyann
- Add
cudf::binary_operationNULL_MIN,NULL_MAX&NULL_EQUALSfordecimal32anddecimal64(#7119) @codereport - Add
scaleandvaluemethods tofixed_point(#7109) @codereport - Replace ORC writer api with class (#7099) @rgsl888prabhu
- Pack/unpack functionality to convert tables to and from a serialized format. (#7096) @nvdbaranec
- Improve
digitizeAPI (#7071) @isVoid - Add List types support in data generator (#7064) @galipremsagar
cudf::scansupport fordecimal32anddecimal64(#7063) @codereportcudf::rollingROW_NUMBERsupport fordecimal32anddecimal64(#7061) @codereport- Replace parquet writer api with class (#7058) @rgsl888prabhu
- Support contains() on lists of primitives (#7039) @mythrocks
- Implement
cudf::rollingfordecimal32anddecimal64(#7037) @codereport - Add
ffillandbfillto string columns (#7036) @isVoid - Enable round in cudf for DataFrame and Series (#7022) @ChrisJar
- Extend
replace_nulls_policytostringanddictionarytype (#7004) @isVoid - Add segmentedgather(listcolumn, gather_list) (#7003) @karthikeyann
- Add
methodfield tofillnafor fixed width columns (#6998) @isVoid - Manual merge of branch 0.17 into branch 0.18 (#6995) @shwina
- Implement
cudf::reducefordecimal32anddecimal64(part 2) (#6980) @codereport - Add Ufunc alias look up for appropriate numpy ufunc dispatching (#6973) @VibhuJawa
- Add pytest-xdist to dev environment.yml (#6958) @galipremsagar
- Add
Index.set_namesapi (#6929) @galipremsagar - Add
replace_nullAPI withreplace_policyparameter,fixed_widthcolumn support (#6907) @isVoid - Share
factorizeimplementation with Index and cudf module (#6885) @brandon-b-miller - Implement update() function (#6883) @skirui-source
- Add groupby idxmin, idxmax aggregation (#6856) @karthikeyann
- Implement
cudf::reducefordecimal32anddecimal64(part 1) (#6814) @codereport - Implement cudf.DateOffset for months (#6775) @brandon-b-miller
- Add Python DecimalColumn (#6715) @shwina
- Add dictionary support to libcudf groupby functions (#6585) @davidwendt
Improvements π οΈ
- Update stale GHA with exemptions & new labels (#7395) @mike-wendt
- Add GHA to mark issues/prs as stale/rotten (#7388) @Ethyling
- Unpin from numpy < 1.20 (#7335) @shwina
- Prepare Changelog for Automation (#7309) @galipremsagar
- Prepare Changelog for Automation (#7272) @ajschmidt8
- Add JNI support for converting Arrow buffers to CUDF ColumnVectors (#7222) @tgravescs
- Add coverage for
skiprowsandnum_rowsin parquet reader fuzz testing (#7216) @galipremsagar - Define and implement more behavior for merging on categorical variables (#7209) @brandon-b-miller
- Add CudfSeriesGroupBy to optimize dask_cudf groupby-mean (#7194) @rjzamora
- Add dictionary column support to rolling_window (#7186) @davidwendt
- Modify the semantics of
endpointers in cuIO to match standard library (#7179) @vuule - Adding unit tests for
fixed_pointwith extremely largescales (#7178) @codereport - Fast path single column sort (#7167) @davidwendt
- Fix -Werror=sign-compare errors in device code (#7164) @trxcllnt
- Refactor cudf::string_view host and device code (#7159) @davidwendt
- Enable logic for GPU auto-detection in cudfjni (#7155) @gerashegalov
- Java bindings for Fixed-point type support for Parquet (#7153) @razajafri
- Add Java interface for the new API 'explode' (#7151) @firestarman
- Replace offsets with iterators in cuIO utilities and CSV parser (#7150) @vuule
- Add gbenchmarks for reduction aggregations any() and all() (#7129) @davidwendt
- Update JNI for contiguous_split packed results (#7127) @jlowe
- Add JNI and Java bindings for list_contains (#7125) @kuhushukla
- Add Java unit tests for window aggregate 'collect' (#7121) @firestarman
- verify window operations on decimal with java tests (#7120) @sperlingxx
- Adds in JNI support for creating an list column from existing columns (#7112) @revans2
- Build libcudf with -Wall (#7105) @trxcllnt
- Add columndeviceview pointers to EncColumnDesc (#7097) @kaatish
- Add
pyorcto dev environment (#7085) @galipremsagar - JNI support for creating struct column from existing columns and fixed bug in struct with no children (#7084) @revans2
- Fastpath single strings column in cudf::sort (#7075) @davidwendt
- Upgrade nvcomp to 1.2.1 (#7069) @rongou
- Refactor ORC
ProtobufReaderto make it more extendable (#7055) @vuule - Add Java tests for decimal casts (#7051) @sperlingxx
- Auto-label PRs based on their content (#7044) @jolorunyomi
- Create sort gbenchmark for strings column (#7040) @davidwendt
- Refactor io memory fetches to use hostdevice_vector methods (#7035) @ChrisJar
- Spark Murmur3 hash functionality (#7024) @rwlee
- Fix libcudf strings logic where size_type is used to access INT32 column data (#7020) @davidwendt
- Adding decimal writing support to parquet (#7017) @hyperbolic2346
- Add compression="infer" as default for daskcudf.readcsv (#7013) @rjzamora
- Correct ORC docstring; other minor cuIO improvements (#7012) @vuule
- Reduce number of hostdevice_vector allocations in parquet reader (#7005) @devavret
- Check output size overflow on strings gather (#6997) @davidwendt
- Improve representation of
MultiIndex(#6992) @galipremsagar - Disable some pragma unroll statements in thrust sort.h (#6982) @davidwendt
- Minor
cudf::roundinternal refactoring (#6976) @codereport - Add Java bindings for URL conversion (#6972) @jlowe
- Enable strictdecimaltypes in parquet reading (#6969) @sperlingxx
- Add in basic support to JNI for logical_cast (#6954) @revans2
- Remove duplicate file array_tests.cpp (#6953) @karthikeyann
- Add null mask
fixed_point_column_wrapperconstructors (#6951) @codereport - Update Java bindings version to 0.18-SNAPSHOT (#6949) @jlowe
- Use simplified
rmm::exec_policy(#6939) @harrism - Add null count test for applybooleanmask (#6903) @harrism
- Implement DataFrame.quantile for datetime and timedelta data types (#6902) @ChrisJar
- Remove **kwargs from string/categorical methods (#6750) @shwina
- Refactor rolling.cu to reduce compile time (#6512) @mythrocks
- Add static type checking via Mypy (#6381) @shwina
- Update to official libcu++ on Github (#6275) @trxcllnt
- C++
Published by GPUtester about 5 years ago
https://github.com/rapidsai/cudf - v0.17.0
v0.17.0 Release
- C++
Published by GPUtester about 5 years ago
https://github.com/rapidsai/cudf - v0.16.0
v0.16.0 Release
- C++
Published by GPUtester over 5 years ago
https://github.com/rapidsai/cudf - v0.15.0
v0.15.0 Release
- C++
Published by raydouglass over 5 years ago