Recent Releases of https://github.com/rapidsai/cudf

https://github.com/rapidsai/cudf - v25.08.00

🚨 Breaking Changes

  • Allow np.dtype('object') for cases that are valid (#19478) @galipremsagar
  • [FEA] Remove CUDA JIT-Compatibility Checks & CCCL WARs (#19470) @lamarrr
  • Drop cuda 11 usages (#19386) @galipremsagar
  • Deprecate cudf::round for float types (#19298) @davidwendt
  • Support output_dtype in cudf::reduce for nunique aggregation (#19265) @davidwendt
  • Change default cudf-polars executor to "streaming" (#19263) @TomAugspurger
  • Fix Handling of Complex Types in AST (#19248) @lamarrr
  • Enable chunked reading of PQ sources with >2B rows (#19245) @mhaseeb123
  • Refactor grid_1d class (#19211) @lamarrr
  • Return valid for all-nulls in reduce() with nunique include-nulls aggregation (#19196) @davidwendt
  • Refactor JNI error handling (#19149) @ttnghia
  • Remove CUDA 11 from dependencies.yaml (#19139) @KyleFromNVIDIA
  • Quick fixes of modernize-use-constraints rule (#19105) @vuule
  • Filter Parquet row groups using row bounds (#19082) @mhaseeb123
  • Temporarily revert "Refactor JNI error handling (#18983)" (#19076) @abellina
  • Rename parquet_chunked_writer to chunked_parquet_writer for consistency with the reader (#19047) @mhaseeb123
  • Compile libcudf using C++20 Standard (#19045) @vuule
  • Refactor JNI error handling (#18983) @ttnghia
  • stop uploading packages to downloads.rapids.ai (#18973) @jameslamb
  • Remove deprecated Series methods, isclose (#18947) @mroeschke
  • Remove deprecated groupby.collect (#18946) @mroeschke
  • Remove deprecated get_dummies(cats=, ...) (#18944) @mroeschke
  • Add pylibcudf.Column.from_arrow factory method (#18937) @Matt711
  • Add pylibcudf.Table.from_arrow factory method (#18936) @Matt711
  • Remove deprecated APIs (#18933) @vuule
  • Remove cudf.Scalar (#18927) @mroeschke
  • Remove deprecated cudf::io::host_buffer (#18881) @Matt711
  • Null-handling for Transforms (#18845) @lamarrr
  • Enable skip_rows in the chunked parquet reader. (#18130) @mhaseeb123

πŸ› Bug Fixes

  • Increase alignment requirement for parquet bloom filter to 256 (#19595) @mhaseeb123
  • Revert "Add primitive row dispatch support for semi/anti join and cudf::contains" (#19503) @PointKernel
  • Allow np.dtype('object') for cases that are valid (#19478) @galipremsagar
  • Add conda dependency on nvidia-ml-py. (#19454) @bdice
  • Mark cudf.pandas notebook repr test as flaky (#19441) @Matt711
  • Fix pytest to properly expose a bug (#19433) @galipremsagar
  • Switch from thrust::sort to cub::DeviceRadixSort in Parquet chunked reader (#19414) @ttnghia
  • Use numba-cuda>=0.15.2,<0.16 (#19413) @bdice
  • Update String Transform Examples (#19407) @lamarrr
  • [BUG] Make floor division and modulo by 0 match CPU polars (#19406) @Matt711
  • Handle empty input in cudf::strings::extract APIs (#19398) @davidwendt
  • Fix jitify error on exit from FILTER_TEST (#19395) @davidwendt
  • Update cudf.pandas tests to silence deprecation warnings (#19377) @Matt711
  • Replace sprintf with snprintf in libcudf parquet tests (#19371) @davidwendt
  • Make DateOffset respect timezone (#19366) @Matt711
  • Fix flaky tests in cudf.pandas (#19345) @TomAugspurger
  • Update protocol choices for ucxx in PDSH benchmark (#19343) @TomAugspurger
  • Remove passing pandas tests from xfail list (#19341) @Matt711
  • Fix Union-Slice bug (#19336) @Matt711
  • Fix bit shift overflow in segmentedoffsetbitmask_binop utility (#19329) @davidwendt
  • Fix job filters for pandas-tests (#19322) @galipremsagar
  • Fix compile warning in interop_stringview.cpp (#19320) @davidwendt
  • Fix a use-after-free issue in TDigest aggregation code. (#19311) @nvdbaranec
  • Always represent datetime aware data as UTC in strftime (#19304) @mroeschke
  • Do not pass cupy objects objects to numba kernels directly (#19283) @brandon-b-miller
  • Correct docstring for DataFrame.apply to match code (#19262) @dagardner-nv
  • Cast n_unique aggregation result to match polars (#19256) @Matt711
  • Fix Handling of Complex Types in AST (#19248) @lamarrr
  • Add missing include (#19239) @vyasr
  • Raised MixedTypeErrors for condition that lead to mixed types (#19232) @galipremsagar
  • Fix errors in the nvCOMP adapter (#19221) @vuule
  • Remove nvToolsExt usage (#19209) @vyasr
  • Fix a pair of bugs in getdecompressionscratch() size. (#19207) @nvdbaranec
  • Allow is_list_like to return correct values by disabling it (#19188) @galipremsagar
  • Fix slicing after Join and GroupBy in streaming cudf-polars (#19187) @rjzamora
  • Fix binops type preservation for some dtypes (#19183) @galipremsagar
  • Fix streaming GroupBy on non-trivial keys (#19181) @rjzamora
  • Fix bitmask in fromarrowhost for sliced stringview type (#19174) @davidwendt
  • Fixed group_by mean with missing values and multiple partitions (#19165) @TomAugspurger
  • Add fallback to HStack lowering in cudf-polars (#19163) @rjzamora
  • Fix Literal partitioning in cudf-polars (#19160) @rjzamora
  • Fix from_array_interface for empty arrays (#19144) @Matt711
  • Adding GH_TOKEN pass-through to summarize job (#19143) @msarahan
  • Fix hash collision in Union([MapFunction]) (#19124) @TomAugspurger
  • Fix bug in group_by().n_unique() in streaming cudf-polars (#19108) @rjzamora
  • Parse (non-MultiIndex) label-based keys to structured data (#19103) @mroeschke
  • Fix cudf_polars spilling (#19101) @TomAugspurger
  • Fix libcudf strings case logic to set null-row size to zero (#19095) @davidwendt
  • Temporarily revert "Refactor JNI error handling (#18983)" (#19076) @abellina
  • Temporary workaround for incorrect SplitScan results in cuDF-Polars (#19071) @rjzamora
  • Use default memory resource for JSONQUOTENORMALIZATION gtests (#19057) @davidwendt
  • Added null-probability to polynomial benchmarks and fixed transform call-sites (#18972) @lamarrr
  • Fix flaky custreamz test (#18961) @TomAugspurger
  • Fix tdigest percentile correctness for low row-counts (#18952) @mythrocks
  • Enable skip_rows in the chunked parquet reader. (#18130) @mhaseeb123

πŸ“– Documentation

  • Update conda environment file for CUDA 12.9 compatibility (#19376) @a-hirota
  • Update recommended gcc version in contibuting guide (#19365) @davidwendt
  • Autodoc DateOffset (#19297) @wence-
  • Fix cudf::columndeviceview::element() doxygen (#19296) @davidwendt
  • Document aggregations for cudf::reduce in doxygen (#19264) @davidwendt
  • add docs on CI workflow inputs (#19234) @jameslamb
  • Update README and CONTRIBUTING to reflect new CUDA requirements (#19138) @PointKernel
  • Remove the extra index URL for CUDA 12 (#19128) @vyasr
  • Improve WordPieceVocabulary.tokenize documentation (#19098) @davidwendt
  • Add some basic streaming engine documentation (#19088) @wence-
  • Update the contributing guide to include pylibcudf in the build command (#19011) @Matt711
  • Fix pylibcudf docs for some strings APIs (#19004) @davidwendt
  • Update cuDF Python library design with BaseIndex and pylibcudf updates (#18903) @mroeschke

πŸš€ New Features

  • Avoid using UVM on systems without a traditional memory resource (#19444) @Matt711
  • Add parquet-sampling configuration options (#19423) @rjzamora
  • Add new JSON reader interface accepting string column input to pylibcudf (#19400) @shrshi
  • Add a parquet reader utility to update output null masks (#19370) @mhaseeb123
  • Build and ship shim.cu file as LTOIR (#19368) @brandon-b-miller
  • Add cudf::strings::find_instance API (#19326) @davidwendt
  • Add single-file streaming Sink support (#19317) @rjzamora
  • Support null_count expression (#19314) @Matt711
  • Materialize tables in the experimental Parquet reader (#19308) @mhaseeb123
  • Add new cudf::top_k API (#19303) @davidwendt
  • Add cudf::strings::split_part API (#19289) @davidwendt
  • Support output_dtype in cudf::reduce for nunique aggregation (#19265) @davidwendt
  • Add post_traversal API to cudf-polars (#19258) @rjzamora
  • Deprecate DataFrame.apply_rows (#19218) @brandon-b-miller
  • Require numba-cuda&gt;=0.16.0 (#19213) @brandon-b-miller
  • Add a mode to co-process decompression and compression on host and device (#19203) @vuule
  • Return valid for all-nulls in reduce() with nunique include-nulls aggregation (#19196) @davidwendt
  • Refactor JNI error handling (#19149) @ttnghia
  • Add support for horizontal string concatenation pl.concat_str (#19142) @Matt711
  • Add PDS-DS Query 1 (#19131) @Matt711
  • Support cudf-polars str.reverse (#19117) @brandon-b-miller
  • Support cudf-polars str.pad_end and str.pad_start (#19116) @brandon-b-miller
  • Support cudf-polars str.head and str.tail (#19115) @brandon-b-miller
  • Support cudf-polars str.to_titlecase (#19114) @brandon-b-miller
  • Add cudf/io/codec.hpp to expose compression/decompression APIs (#19113) @ttnghia
  • Support converting decimals to/from pylibcudf scalars (#19106) @Matt711
  • Support resource-constrained sort-merge inner join operation through left table partitioning (#19102) @shrshi
  • Filter Parquet row groups using row bounds (#19082) @mhaseeb123
  • Implement UDF Filters (#19070) @lamarrr
  • Move the remaining libcudf pieces to C++20 (#19065) @vuule
  • Allow using a stream per thread at runtime (#19051) @vyasr
  • Remove stacktrace retrieval code (#19048) @ttnghia
  • Compile libcudf using C++20 Standard (#19045) @vuule
  • String Transform Examples: Added Branching, Public API Versions, and Sampling (#19038) @lamarrr
  • Refactor JNI error handling (#18983) @ttnghia
  • Add basic Sink support for streaming cudf-polars executor (#18963) @rjzamora
  • Fix debug-build Failure in JIT Tests (#18939) @lamarrr
  • Add from_arrow factory methods for Scalar and DataType (#18938) @Matt711
  • Add pylibcudf.Column.from_arrow factory method (#18937) @Matt711
  • Add pylibcudf.Table.from_arrow factory method (#18936) @Matt711
  • Update nvCOMP adapter (#18931) @vuule
  • Create a pylibcudf Column from a iterable of python strings (#18916) @Matt711
  • Add CLI argument to enable OOM protection in PDS-H (#18914) @pentschev
  • Implement data page pruning using Parquet page index stats (#18873) @mhaseeb123
  • Null-handling for Transforms (#18845) @lamarrr
  • Implement row group pruning with dictionaries in experimental PQ reader (#18836) @mhaseeb123
  • Add support for parquet scan + count operation (#18463) @Matt711
  • Manage strings with NRT (#18453) @brandon-b-miller

πŸ› οΈ Improvements

  • Disable codecov comments (#19472) @bdice
  • [FEA] Remove CUDA JIT-Compatibility Checks & CCCL WARs (#19470) @lamarrr
  • Use libnvcomp conda package (#19439) @bdice
  • JNI Set RMMLOGLEVEL and RMMLOGACTIVE_LEVEL to allow setting log level at compile time (#19435) @abellina
  • Use numba-cuda >=0.14.0,<0.15.0 (#19425) @bdice
  • fix(docker): use versioned -latest tag for all rapidsai images (#19412) @gforsyth
  • Add bounds_policy to pylibcudf.lists.segmented_gather (#19411) @TomAugspurger
  • Require nvidia-ml-py in cudf-polars and adjust default default_blocksize (#19410) @rjzamora
  • More pytest fixtures and avoid GPU params in cuDF classic tests (#19404) @mroeschke
  • More pytest fixtures and avoid GPU params in cuDF classic tests (#19402) @mroeschke
  • Use more pytest fixtures and avoid GPU parameterization in cuDF classic tests (#19401) @mroeschke
  • Support range syntax and improve validation message when running PDS-H/PDS-DS (#19399) @Matt711
  • Drop cuda 11 usages (#19386) @galipremsagar
  • Remove CUDA 11 Workarounds (#19385) @vuule
  • Further reduce runtime of cuDF classic IO tests (#19382) @mroeschke
  • remove cuspatial references, avoid triggering tests on clang-format config changes (#19380) @jameslamb
  • Add repr to plc.aggregation.Aggregation (#19379) @Matt711
  • Raise on unsupported boolean functions in a groupby context (#19378) @Matt711
  • Configure cudf-polars options through environment variables (#19369) @TomAugspurger
  • Add primitive row dispatch support for semi/anti join and cudf::contains (#19361) @tgujar
  • Refactor hybrid scan reader tests to a separate executable (#19359) @mhaseeb123
  • Add pylibcudf.Column.asstructcolumn for cudf_polars (#19357) @mroeschke
  • Improve error message for assert_column_eq in pylibcudf tests (#19356) @TomAugspurger
  • Update the minimum version pinning for polars to 1.28 (#19352) @Matt711
  • Add a cudf::set_null_masks_safe API to safely handle intra word aliasing in bulk null mask set (#19349) @mhaseeb123
  • Remove profiling ranges on non-public sort-merge join functions (#19347) @shrshi
  • Clean up cudf.lib.stringsudf.pyx (#19335) @mroeschke
  • Add support for pandas-2.3.1 (#19334) @galipremsagar
  • Allow comparison binop to datetime.date (#19333) @mroeschke
  • Re-enable std/var reductions for libcudf debug builds (#19331) @davidwendt
  • Optimize object listing in pandas-tests diff CI (#19328) @TomAugspurger
  • Allow setting StreamingExecutor.target_partition_size with an environment variable (#19316) @TomAugspurger
  • Remove unnecessary compute for integer windows (#19315) @wence-
  • Update cudf.pandas test skips for pandas==2.3.1 (#19313) @TomAugspurger
  • Support Expr.str.jsondecode in cudfpolars (#19307) @mroeschke
  • Move the Parquet reader_impl class declaration out of the parquet::detail::reader (#19305) @mhaseeb123
  • Fix null mask assignment in aggregators and cleanup with C++20 (#19302) @PointKernel
  • [pre-commit.ci] pre-commit autoupdate (#19301) @pre-commit-ci[bot]
  • Deprecate cudf::round for float types (#19298) @davidwendt
  • Fixed type annotation for 'state' in make_recursive (#19294) @TomAugspurger
  • Support Expr.str.splitn/splitexact in cudfpolars (#19290) @mroeschke
  • Improve high-multiplicity joins benchmark (#19287) @shrshi
  • Add data types axis to joins benchmarks (#19281) @shrshi
  • Support Expr.str.stripprefix/suffix in cudfpolars (#19278) @mroeschke
  • Support Expr.str.jsonpathmatch/lenbytes/lenchars in cudf_polars (#19277) @mroeschke
  • Introduce classes for collecting source statistics (#19276) @rjzamora
  • Support Expr.str.find & Expr.str.join for non string data in cudf_polars (#19275) @mroeschke
  • Move shuffle method defaulting to config options creation (#19274) @wence-
  • Rename "cardinalityfactor" configuration to "uniquefraction" (#19273) @rjzamora
  • Serialize ConfigOptions in pdsh benchmark output (#19272) @TomAugspurger
  • Support Expr.str.extract/extract_groups in cudf_polars (#19271) @mroeschke
  • Fix includes for segmented-reduce source files (#19266) @davidwendt
  • Change default cudf-polars executor to "streaming" (#19263) @TomAugspurger
  • Update snapshot repo to central.soantype.com (#19259) @pxLi
  • Raise NotImplementedError for LazyFrame.profile with the streaming exeuctor (#19257) @TomAugspurger
  • Move ast expression function definitions to .cpp files (#19250) @davidwendt
  • Enable chunked reading of PQ sources with &gt;2B rows (#19245) @mhaseeb123
  • Support str.count_matches and str.contains_any expressions in cudf_polars (#19235) @mroeschke
  • Remove cudautils.py (#19233) @mroeschke
  • Use CUDA 12.9 in Conda, Devcontainers, Spark, GHA, etc. (#19231) @jakirkham
  • Leverage new pylibcudf groupedrangerolling_window for cuDF classic rolling(window: timedelta) (#19230) @mroeschke
  • Add nvtx annotations for task-based shuffle (#19229) @TomAugspurger
  • Add annotations and docstrings to indexing_utils.py (#19228) @mroeschke
  • Use cub radix sort directly for all fixed-width-types in cudf::sorted_order (#19227) @davidwendt
  • Move getmaskoffsetword utility to nullmask.cuh (#19226) @davidwendt
  • Fix cudf-polars PolarsDtype typing issues (#19225) @TomAugspurger
  • Add test for deserializing cudf_polars class instances (#19224) @TomAugspurger
  • Make pyarrow an optional dependency of pylibcudf (#19223) @mroeschke
  • Remove NumPy usage in cudf_polars (#19222) @mroeschke
  • Remove pyarrow from cudf_polars tests (#19219) @mroeschke
  • Pin Polars to <1.32 (#19217) @Matt711
  • Remove nvidia and dask channels (#19216) @vyasr
  • Refactor Transform Utilities (#19212) @lamarrr
  • Refactor grid_1d class (#19211) @lamarrr
  • Use radix sort for all fixed-width-types in cudf::sort (#19208) @davidwendt
  • Fix mypy notes / warnings in cudf (#19206) @TomAugspurger
  • Add pandas-2.3.0 support (#19202) @galipremsagar
  • Avoid pylibcudf.interop.to_arrow in DataFrame.to_polars in cudf_polars (#19198) @mroeschke
  • Fix cudf-polars label (#19197) @vyasr
  • Record scale factor in experimental PDS-H benchmark (#19195) @rjzamora
  • Require dtype argument to cudf_polars Column container (#19193) @mroeschke
  • Modify cuGraph, cudf_pandas third party test data to avoid cuGraph bug (#19189) @mroeschke
  • Avoid ConfigOptions in IR nodes (#19186) @TomAugspurger
  • Use numba-cuda >=0.14.0,<0.15.0 to get pynvjitlink by default. (#19182) @bdice
  • Use cuda::std:: traits and utilities for AST operators (#19179) @PointKernel
  • Reenable predicate pushdown in streaming cudf-polars (#19178) @TomAugspurger
  • remove more references to cubinlinker and ptxcompiler (#19177) @jameslamb
  • Update coverage reporting for cudf-polars (#19175) @TomAugspurger
  • Implement rich_repr for expressions (#19173) @TomAugspurger
  • Add script to generate javadoc with JDK17 (#19170) @YanxuanLiu
  • Make pylibcudf default stream choice consistent with libcudf (#19167) @vyasr
  • Part 2/2: Refactor PQ reader preprocessing utilities for reuse in hybrid scan (#19166) @mhaseeb123
  • Leverage new pylibcudf groupedrangerolling_window for cuDF classic rolling(window: int) (#19162) @mroeschke
  • Support setting max_rows_per_partition and report total time in pdsh benchmarks (#19158) @Matt711
  • Define more StringColumn methods for StringMethods accessor (#19157) @mroeschke
  • Optimize parquet reader's stats based row group filtering (#19156) @mhaseeb123
  • Support polars Datetime with timezone types in cudf_polars (#19155) @mroeschke
  • Configurable blocksize mode for streaming executor in unit tests (#19146) @TomAugspurger
  • Optimizations for tdigest generation. (#19140) @nvdbaranec
  • Remove CUDA 11 from dependencies.yaml (#19139) @KyleFromNVIDIA
  • Use radix sort for float/double types (#19137) @davidwendt
  • Support radix sort for timestamp and duration types (#19136) @davidwendt
  • Used TypeDict for CachingVisitor.state (#19135) @TomAugspurger
  • Move Accessor implementation to their own directory (#19134) @mroeschke
  • Add benchmarks for sorting float and timestamp (#19133) @davidwendt
  • Enable using page mask in decompress_page_data in Parquet reader (#19132) @mhaseeb123
  • refactor(shellcheck): fix all shellcheck warnings/errors (#19129) @gforsyth
  • Remove pytest pin (#19127) @vyasr
  • Move pdsh utility functions/classes to a seperate module (#19126) @Matt711
  • Use pylibcudf.Column.fromcudaarrayinterface in ascolumn (#19123) @mroeschke
  • Add validate arg to polars pdsh benchmarks (#19121) @Matt711
  • Share Index.values with base implementaiton (#19112) @mroeschke
  • Use len instead of len(obj.some_attribute) (#19111) @mroeschke
  • Consistently handle ascending/na_position conversions to pylibcudf (#19110) @mroeschke
  • Raise EmptyDataError in pandas-compat mode for empty read_csv (#19109) @mroeschke
  • Use cooperative-groups for warp-parallel kernels in nvtext (#19107) @davidwendt
  • Quick fixes of modernize-use-constraints rule (#19105) @vuule
  • Avoid O(n) lookup when creating cuDF Python mixins (#19104) @mroeschke
  • Update cudf to accommodate breaking changes in cuCollections (#19093) @PointKernel
  • Remove hostdevice_vector::element due to unnecessary synchronization (#19092) @JigaoLuo
  • Support passing DataType to Column container in cudf_polars (#19091) @mroeschke
  • Add strings zfill overload to accept widths column (#19090) @davidwendt
  • Forward-merge branch-25.06 to branch-25.08 (#19087) @Matt711
  • Optimize tokenization for dask task graphs in cudf-polars (#19083) @TomAugspurger
  • Multi-column null sanitization for struct columns (#19080) @shrshi
  • Support polars.Expr.value_counts in cudf_polars (#19079) @mroeschke
  • Support polars.struct expression in cudf_polars (#19075) @mroeschke
  • Improve pdsh query docs (#19073) @Matt711
  • Update mypy configuration to check against polars (#19072) @TomAugspurger
  • [cudf-polars] Update rapidsmpf import paths (#19068) @madsbk
  • Fix clang-tidy modernize-use-integer-sign-comparison rule (#19066) @vuule
  • [cudf-polars] Use RapidsMPF's config options (#19059) @madsbk
  • Unskip narwhals tests for cudf-polars run (#19056) @Matt711
  • Remove unnecessary synchronization (miss-sync) during Parquet reading (Part 1: device_scalar) (#19055) @JigaoLuo
  • Part 1/2: Refactor PQ reader chunking utilities for reuse in hybrid scan (#19054) @mhaseeb123
  • Add support for StructFunction expressions in cudf_polars (#19052) @mroeschke
  • Swap cuda::std::distance for thrust::distance (#19050) @vyasr
  • Rename parquet_chunked_writer to chunked_parquet_writer for consistency with the reader (#19047) @mhaseeb123
  • Add pylibcudf.Scalar.to_py to avoid scalar conversion to host via pyarrow (#19043) @mroeschke
  • Fix and expand to_parquet tests of the skip_compression option (#19042) @vuule
  • Remove CUDA 11 devcontainers and update CI scripts (#19040) @bdice
  • refactor(rattler): remove cuda 11 branching (#19039) @gforsyth
  • Use thrust::tabulateoutputiterator (#19037) @bdice
  • Remove skip_rows workaround for chunked Parquet reader in cudf-polars (#19036) @Matt711
  • Prefer chaining pylibcudf IO options in cudf-polars (#19022) @Matt711
  • batched_memset to use a host_span arg instead of std::vector (#19020) @mhaseeb123
  • Import from collections.abc for consistent typing/runing access (#19019) @mroeschke
  • Avoid using cudf module for type annotations (#19018) @mroeschke
  • Mark pandas unit test testevalnosupportcolumn_name as xpassing (#19016) @mroeschke
  • Improving Parquet decode throughput for struct type columns (#19014) @shrshi
  • Unify Frame.split and DataFrame.scatterbymap/partitionby_hash implementations (#19013) @mroeschke
  • Move IndexedFrame.memory_usage docstrings to DataFrame/Series, make RangeIndex methods consistent with base class (#19010) @mroeschke
  • Share DataFrame/Series.(de)seralize methods, implement to_dlpack directly on Frame (#19008) @mroeschke
  • Pin narhwals to 1.41 (#19007) @Matt711
  • Add year range check to cudf::strings::is_timestamp (#19006) @davidwendt
  • Add cudf::strings::contains_multiple to pylibcudf (#19003) @davidwendt
  • Avoid unnecessary partition step in streaming join (#19002) @rjzamora
  • Part 2/n: Use cooperative groups in PQ decoders (#18978) @mhaseeb123
  • Move libcudf copying benchmarks to nvbench (#18976) @davidwendt
  • Add lag/lead/bitwise/row_number aggregations to pylibcudf (#18975) @mroeschke
  • Switch to importing rather than cimporting datetime (#18974) @vyasr
  • stop uploading packages to downloads.rapids.ai (#18973) @jameslamb
  • Trace IR.do_evaluate in cudf_polars (#18970) @TomAugspurger
  • xfail more pandas unit tests that fail with cudf.pandas before execution instead of xfailing after execution (#18965) @mroeschke
  • Remove test checks that depend on the compression engine (#18960) @vuule
  • Use cooperative-groups for warp-parallel kernels in strings functions (#18959) @davidwendt
  • fetch code before running pull request labeler (#18958) @jameslamb
  • Use cooperative groups in parquet decoder kernels (#18954) @mhaseeb123
  • Add a DataType container in cudf_polars (#18953) @mroeschke
  • add 'rapids-init-pip' to testcudfpolarspolarstests.sh (#18951) @jameslamb
  • parameterized ucx / ucxx (#18949) @quasiben
  • Rework cudf::sorted_order implementation for faster compile (#18948) @davidwendt
  • Remove deprecated Series methods, isclose (#18947) @mroeschke
  • Remove deprecated groupby.collect (#18946) @mroeschke
  • Remove deprecated get_dummies(cats=, ...) (#18944) @mroeschke
  • Add .python_typecode and .typestr attributes to DataType (#18941) @Matt711
  • Remove deprecated APIs (#18933) @vuule
  • Remove cudf.Scalar (#18927) @mroeschke
  • Add #pragma once to prevent redundant includes and speed up compilation (#18925) @PointKernel
  • Bump polars version to <1.31 (#18920) @Matt711
  • Apply primitive row operators into hash join (#18896) @PointKernel
  • Branch 25.08 merge branch 25.06 (#18895) @vyasr
  • Remove deprecated cudf::io::host_buffer (#18881) @Matt711
  • Fix decompression scratch size in AUTO mode (#18878) @vuule
  • Apply linter suggestions to cuIO code (#18876) @vuule
  • xfail pandas unit tests that fail with cudf.pandas (#18872) @mroeschke
  • Branch 25.08 merge branch 25.06 (#18855) @vyasr
  • Add support for extended dtypes in cudf.pandas (#18832) @galipremsagar
  • Auto merge fix for branch-25.08 (#18824) @davidwendt
  • Forward-merge branch-25.06 to branch-25.08 (#18817) @Matt711
  • Forward-merge branch-25.06 to branch-25.08 (#18756) @Matt711
  • Fix auto merge conflict for branch-25.08 (#18733) @davidwendt
  • Forward-merge branch-25.06 to branch-25.08 (#18698) @Matt711
  • Fix merge conflict for auto-merger 25.06 to 25.08 (#18693) @davidwendt
  • Fix merge conflict: branch-25.06 into branch-25.08 (#18668) @davidwendt
  • Make cuda12 as JNI default (#18651) @pxLi
  • Forward-merge branch-25.06 into branch-25.08 (#18647) @bdice
  • Fix merge branch-25.06 into branch-25.08 (#18622) @davidwendt
  • Store polars Series instead of pyarrow Array in cudf_polars LiteralColumn expr (#18564) @mroeschke
  • Refactor strings split/record with whitespace logic (#18560) @davidwendt
  • Refactor hash join with multiset (#18021) @PointKernel

- C++
Published by AyodeAwe 7 months ago

https://github.com/rapidsai/cudf - [NIGHTLY] v25.10.00

πŸ”— Links

πŸ› Bug Fixes

  • Fix logic for number of unique values generated by data profile in benchmarks (#19540) @shrshi
  • Fix value counts expression when the column has nulls (#19524) @Matt711
  • Prefer Column.astype over plc.unary.cast in the fill null unary function expression (#19479) @Matt711
  • Fix missing return in StringFunction.Strptime strict=True path (#19464) @Matt711
  • Make dividing a boolean column return f64 dtype in cudf-polars (#19443) @Matt711
  • branch-25.10-merge-branch-25.08 (#19429) @davidwendt

πŸš€ New Features

  • Make nvCOMP ZLIB (de)compression available by default (#19528) @vuule
  • Add primitive row dispatch support for semi/anti join and cudf::contains (#19518) @PointKernel
  • Derive and use page mask at subpass level for chunked reads (#19515) @mhaseeb123
  • Implement top k expression in cudf-polars using cudf::top_k (#19431) @Matt711
  • [FEA] Add chunked Parquet sink support using the libcudf writer (#19015) @Matt711

πŸ› οΈ Improvements

  • Move timeout in cudf.pandas pandas unit tests script to ci script (#19542) @mroeschke
  • Get rid of CG logic in the mixed semi-join kernel (#19536) @PointKernel
  • Construct more cuDF classic Columns with pylibcudf instead of using Buffers (#19535) @mroeschke
  • Fix clang-tools version pinning (#19529) @wence-
  • Add cudfpolars unit test for `isin([])` expr (#19525) @mroeschke
  • Expose nvtext::letter_type to python (#19520) @Matt711
  • Add missing import of pyarrow.parquet when reading specified row_groups. (#19509) @bdice
  • Don't run serial cudf_pandas tests when testing multiple pandas versions (#19507) @mroeschke
  • Add nvtx ranges and minor fix for lists types in the next-gen parquet reader (#19493) @mhaseeb123
  • Move testavro/testapi_types.py and some DataFrame tests to new cudf classic test directory structure (#19490) @mroeschke
  • Move test_series.py to new cudf classic test directory structure (#19485) @mroeschke
  • Move test_testing.py to new cudf classic test directory structure (#19481) @mroeschke
  • Allow latest OS in devcontainers (#19480) @bdice
  • Branch 25.10 merge branch 25.08 (#19475) @davidwendt
  • Improve readability when printing pylibcudf enums (#19451) @Matt711
  • Use more pytest fixtures and avoid GPU parameterization in cuDF classic tests (#19450) @mroeschke
  • Update build infra to support new branching strategy (#19445) @robertmaynard
  • Use more pytest fixtures and avoid GPU parameterization in test_indexing/joining/monotonic/multiindex.py (#19437) @mroeschke
  • Use more pytest fixtures and avoid GPU parameterization in cuDF classic tests (#19436) @mroeschke
  • Update s3 Bucket fixture creation in test_s3 (#19424) @mroeschke
  • Use more pytest fixtures and avoid GPU parameterization in cuDF classic tests (#19419) @mroeschke
  • Use GCC 14 in conda builds. (#19192) @vyasr

- C++
Published by rapids-bot[bot] 7 months ago

https://github.com/rapidsai/cudf - v25.06.00

🚨 Breaking Changes

  • Remove cudf.BaseIndex (#18751) @mroeschke
  • Implement BIT_COUNT unary operation (#18589) @ttnghia
  • Expose column chunk metadata in read_parquet_metadata() (#18579) @mhaseeb123
  • Fix overflow for MERGE_M2 groupby aggregation (#18546) @ttnghia
  • Deduplicate parquet physical type enums (#18526) @mhaseeb123
  • Implemented String Output & User-data Support for Transforms (#18490) @lamarrr
  • Promote Parquet type enums to enum classes (#18441) @mhaseeb123
  • Move parquet schema types and structs to public headers (#18424) @mhaseeb123
  • Start removal of vector factories with _sync suffix by deprecating them and adding versions without the suffix (#18414) @vuule
  • Skip decoding of pages marked as pruned in PQ reader (#18347) @mhaseeb123
  • Deprecate nvtext subword tokenizer (#18334) @davidwendt
  • Add standard data ingestion pipelines to pylibcudf for ndarrays (#18311) @Matt711
  • Remove extranous modules from top level cudf namespace (#18287) @mroeschke
  • Add Keep Option Parameter to Distinct (#18237) @warrickhe
  • Update to CCCL 2.8.x with no CCCL patches (#18235) @bdice

πŸ› Bug Fixes

  • Disable pytest benchmark for Narwhals CI job (#19074) @Matt711
  • Avoid undefined behaviour in rollingstoreoutput_functor (#19069) @wence-
  • Filter out pkg_resources UserWarning to make nightly CI pass (#19058) @Matt711
  • Pin deltalake to <1.0.0 (#19017) @Matt711
  • [BUG] Incorrectly getting the caller's frame when searching for locals and globals in cudf.pandas (#18979) @Matt711
  • Ensure gc fixture is used in custreamz test (#18915) @TomAugspurger
  • Fix a potential segfault in PQ reader's number of rows per source calculation (#18906) @mhaseeb123
  • Fix Dataframe getitem when MultiIndex columns exist (#18880) @galipremsagar
  • Ensure eq/ne between Columns in public objects don't return bool (#18875) @mroeschke
  • Fix fencepost error in Repartition task generation (#18854) @wence-
  • Fix cudf_polars pl.col(...).len() always excluding null values (#18849) @mroeschke
  • Throw a descriptive exception in Parquet reader when trying to read files with more than two billion rows (#18835) @mhaseeb123
  • Skip a decompression test (#18825) @vuule
  • Update strings benchmarks to use alloc_size column/table function (#18822) @davidwendt
  • Fix host decompression of empty DEFLATE data (#18805) @vuule
  • Avoid going OOM in test_row_limit_exceed_raises by using dummy array (#18802) @Matt711
  • Fix host decompression of empty Snappy data (#18800) @vuule
  • Skip test that fails due to polars issue (#18787) @wence-
  • Ensure scalar dtype is always set in from_py (#18780) @vyasr
  • Fix reading of Snappy compressed Avro files (#18774) @vuule
  • Fix missing semicolon in label_bins.cu (#18765) @evanramos-nvidia
  • Fix noexcept annotations on stringscolumnview (#18763) @wence-
  • Fix integer overflows in pylibcudf from_column_view_of_arbitrary (#18758) @wence-
  • Fix overflow case and clean up some logic (#18734) @vyasr
  • Link to nvtx3::nvtx3-cpp instead of nvToolsExt (#18730) @jakirkham
  • Revise DaskIntegration protocol to align with rapidsmpf (#18720) @rjzamora
  • Fix skip_compression option in the Parquet writer with host compression (#18714) @vuule
  • Add missing header (#18671) @vyasr
  • Revert "Set flag to always use unsafe atomic storage" (#18657) @PointKernel
  • Fix optional operator* called on a disengaged value in clamp.cu (#18655) @davidwendt
  • Add missing header to host_memory.cpp (#18649) @alliepiper
  • Fix device compression when writing Parquet files without using nvCOMP (#18644) @vuule
  • Add CUDA_ARCHITECTURES setting to cpp-linters script (#18637) @davidwendt
  • Pin to cython<3.1 (#18617) @wence-
  • Fix DataFrame.memory_usage output order (#18595) @mroeschke
  • Set flag to always use unsafe atomic storage (#18590) @PointKernel
  • Update KvikIO S3 endpoint usage (#18565) @kingcrimsontianyu
  • Skip cuml third-party integration tests that may segfault (#18561) @Matt711
  • Allow .iloc with cuDF objects as column indexers (#18558) @mroeschke
  • Fix overflow for MERGE_M2 groupby aggregation (#18546) @ttnghia
  • Add back cudf root (#18544) @vyasr
  • Change default memory resource for 'distributed' cudf-polars (#18531) @rjzamora
  • Fix copy-on-write buffer separation and cleanup (#18530) @galipremsagar
  • Fix cpp examples cmake to use the rapids_config.cmake (#18501) @davidwendt
  • Rename rapidsmp to rapidsmpf (#18493) @rjzamora
  • Fix compilation with the C++20 standard (#18486) @vuule
  • Fix an error when reading some compressed Parquet V2 files (#18478) @vuule
  • Support title-case characters in strings capitalize() and title() APIs (#18457) @davidwendt
  • Ensure DataFrame column label operations reset label_dtype (#18452) @mroeschke
  • Fix a segfault when reading a Parquet file with unsupported compression type (#18451) @vuule
  • Fix logger macros (#18444) @vyasr
  • Fix auto-detection of compression type in host-side decompression (#18440) @shrshi
  • Use delete not free to release data allocated with new (#18412) @wence-
  • Fix synchronization issues in host compression and decompression (#18395) @vuule
  • Update Dask array-conversion handling (#18382) @rjzamora
  • Fixed indexing on empty DataFrame with no columns (#18381) @TomAugspurger
  • Deterministic hashing for DataFrameScan nodes in cudf-polars multi-partition executor (#18351) @TomAugspurger
  • Fix index of right table in unary operators in AST, in Joins (#18333) @karthikeyann
  • Add offsetalator to contiguous-split (#18312) @davidwendt
  • Support large strings in nvtext vocabulary-tokenizer (#18283) @davidwendt
  • Handle empty aggregations in multi-partition cudf.polars group_by (#18277) @TomAugspurger

πŸ“– Documentation

  • Docs for streaming executor options (#18934) @quasiben
  • Fix some duplicate toctree issues and improve groupby docs (#18580) @vyasr
  • [DOC] Running libcudf benchmarks and comparing output results (#18548) @Matt711
  • Fix doxygen usage of the contraction for it is (#18517) @davidwendt
  • Clarify @brief tag as description/title on documentation guide (#18515) @davidwendt
  • [DOC] Improve clarity in parquet APIs setrowgroups and set_columns parquet (#18466) @Matt711
  • Add a usage page to cudf-polars documentation (#18460) @Matt711
  • [DOC] Fix typo in CONTRIBUTING.md on build type tests (#18456) @JigaoLuo
  • improve docs related to documentation contribution (#18418) @ncclementi
  • Add restart kernel note in cudf pandas docs (#18374) @ncclementi

πŸš€ New Features

  • Add CLI argument to enable RMM async memory resource in PDS-H (#18899) @pentschev
  • Scan a headerless CSV file with column names provided (#18816) @Matt711
  • Add fast paths for DataFrame.to_cupy (#18801) @Matt711
  • Require numba-cuda&gt;=0.11.0 (#18770) @brandon-b-miller
  • Create a pylibcudf Column from a python iterable (#18768) @Matt711
  • Support ConditianalJoin via broadcasting in cudf-polars streaming engine (#18723) @rjzamora
  • Experimental PQ reader utility to calculate total rows in input row groups (#18716) @mhaseeb123
  • Extend explain_query to support printing the logical plan (pre lowered plan) (#18708) @Matt711
  • Reuse libcudf dependencies for Java JNI build when they are available (#18682) @ttnghia
  • Add alloc_size member function to cudf::column and cudf::table (#18639) @davidwendt
  • Print the physical cudf-polars plan in pdsh.py (#18635) @rjzamora
  • String Transform Examples (#18616) @lamarrr
  • Add streaming support for group_by -&gt; n_unique to cudf-polars (#18606) @rjzamora
  • Export cudf compiler flags and definitions (#18604) @ttnghia
  • Implement BIT_COUNT unary operation (#18589) @ttnghia
  • Expose column chunk metadata in read_parquet_metadata() (#18579) @mhaseeb123
  • Add APIs to check ORC and Parquet compression support at runtime (#18578) @vuule
  • Add Distinct support to the cudf-polars streaming executor (#18576) @rjzamora
  • Add support for large list host Arrow data conversion (#18562) @vyasr
  • Implement BITWISE_AGG aggregations (bitwise AND, OR and XOR) for sort-based groupby and reduction (#18551) @ttnghia
  • Implement row group pruning with bloom filters in experimental PQ reader (#18545) @mhaseeb123
  • Implement row group pruning with stats in experimental PQ reader (#18543) @mhaseeb123
  • [JNI] Expose row-wise sha1 api (#18540) @warrickhe
  • Add Sort + head/tail support to streaming cudf-polars executor (#18538) @rjzamora
  • Add multi-partition MapFunction support to cudf-polars (#18523) @rjzamora
  • Adds support for writing raw UTF-8 characters (without escaping) in the JSON writer (#18508) @Matt711
  • Support reading from device buffers in the pylibcudf IO APIs (#18496) @Matt711
  • Support multi-partition Select operations with aggregations (#18492) @rjzamora
  • Implemented String Output & User-data Support for Transforms (#18490) @lamarrr
  • Add a utility to bulk set multiple null masks (#18489) @mhaseeb123
  • High level interface for experimental PQ reader and implementation of metadata APIs (#18480) @mhaseeb123
  • Added pylibcudf.utilities.is_ptds_enabled (#18467) @TomAugspurger
  • Add a public API for copying a table_view to device array (#18450) @Matt711
  • Support cudf-polars cast_time_unit (#18442) @brandon-b-miller
  • Support creating a pylibcudf Column from a host array (#18425) @Matt711
  • Move parquet schema types and structs to public headers (#18424) @mhaseeb123
  • Add optional dtype argument to Scalar.from_any (#18415) @Matt711
  • Expose cudf::chunked_pack in pylibcudf (#18411) @wence-
  • Add support for long string columns in cudf::contiguous_split (#18393) @nvdbaranec
  • Implemented String Input support for Transforms and Removed jit::column_device_view (#18378) @lamarrr
  • Automatically dispatch between host and device decompression/compression based on the number of buffers (#18363) @vuule
  • Expose join hash table load factor (#18361) @PointKernel
  • Skip decoding of pages marked as pruned in PQ reader (#18347) @mhaseeb123
  • Sort-based inner join for high-multiplicity tables (#18318) @shrshi
  • Support constructing pylibcudf Columns and Tables from views into arbitrary objects (#18314) @vyasr
  • Add standard data ingestion pipelines to pylibcudf for ndarrays (#18311) @Matt711
  • Support cudf-polars isoyear and week (isoweek) (#18265) @brandon-b-miller
  • Add Keep Option Parameter to Distinct (#18237) @warrickhe
  • Add rapidsmp shuffle support to cudf-polars (#18231) @rjzamora
  • Support cudf-polars strftime (#18181) @brandon-b-miller
  • Add benchmark for join operations with low build table cardinality (#18105) @shrshi
  • Add nvtext substring deduplication APIs (Part 2) (#18104) @davidwendt
  • Support include_file_paths in cudf polars (#18057) @Matt711
  • Add support for the Arrow device capsule interfaces (#15370) @vyasr

πŸ› οΈ Improvements

  • use 'rapids-init-pip' in wheel CI, other CI changes (#18902) @jameslamb
  • Avoid RecursionError in custreamz test (#18887) @TomAugspurger
  • Update NumPy dependency in cudf.pandas-catboost integration test (#18870) @Matt711
  • CPU only execution for PDSH (#18869) @quasiben
  • Remove more top level cudf imports in core (#18862) @mroeschke
  • Remove top level cudf imports in core (#18857) @mroeschke
  • Add CUDFINSTALLDIR for JAVA build script (#18852) @pxLi
  • Call the correct from_pandas in hdf reader (#18850) @galipremsagar
  • Update __all__ in cudf_polars/dsl/ir.py (#18848) @Matt711
  • Upload examples conda package (#18847) @vyasr
  • Add retries to prevent failures in occasionally slow CI runs (#18843) @galipremsagar
  • Finish CUDA 12.9 migration and use branch-25.06 workflows (#18839) @bdice
  • Remove toplevel import cudf from window/tools/join directories (#18833) @mroeschke
  • Remove toplevel import cudf from cudf/io files (#18829) @mroeschke
  • Update pdsh benchmark script to support explain-only (#18826) @TomAugspurger
  • Refactor UDF utils and add a hook to enable NRT when necessary (#18823) @brandon-b-miller
  • Fix memory access error in nvtext::edit_distance (#18821) @davidwendt
  • Update to clang 20 (#18818) @bdice
  • Reduce more data sizes of Python tests (#18814) @mroeschke
  • Mark DataFrame.dtypes as an externalonly_api (#18809) @mroeschke
  • Change calls to thrust::swap to cuda::std::swap (#18808) @davidwendt
  • Move implemented BaseIndex methods over to Index (#18807) @mroeschke
  • Improve pandas version fetching script (#18793) @galipremsagar
  • Change cudf::sort googlebench benchmarks to nvbench (#18786) @davidwendt
  • Only warn in cudf.pandas if rmm mode explicitly set and rmm already configured (#18785) @jcrist
  • Quote head_rev in conda recipes (#18784) @bdice
  • Move RangeIndex implementation below Index (#18777) @mroeschke
  • Remove unecessary _Ravelled class (#18771) @Matt711
  • Remove pytest-rerunfailures (#18766) @mroeschke
  • Replace from_arrow with direct calls Column/Table constructors in pylibcudf and cudf-polars tests (#18762) @Matt711
  • CUDA 12.9 use updated compression flags (#18755) @robertmaynard
  • fix(rattler): add librmm to host for libcudf to fix overlinking error (#18754) @gforsyth
  • Remove the file name from the output in cudf-polars' explain APIs (#18752) @Matt711
  • Remove cudf.BaseIndex (#18751) @mroeschke
  • Support creating a pylibcudf Column from a general ndarray (#18744) @Matt711
  • Improve lowering of Distinct IR nodes for high-cardinality data (#18725) @rjzamora
  • Simplify Numba-CUDA MVC logic (#18724) @bdice
  • Test with CUDA 12.9.0 (#18721) @bdice
  • Add more cudf.Series microbenchmarks (#18718) @Matt711
  • Run unit-tests-cudf-pandas on branch-25.06 for nightly tests (#18717) @davidwendt
  • Move test_large_unique_categories_repr to benchmarks (#18715) @galipremsagar
  • Allow pylibcudf.Column to consume objects exposing __arrow_c_stream__ (#18712) @mroeschke
  • Switch from printing to logging (#18711) @vyasr
  • Add Python tests for different compression implementations (#18710) @vuule
  • Remove redundant xfails in cuml integration tests (#18699) @Matt711
  • ci: run unit-tests-cudf-pandas on branch-25.06 workflow (#18692) @gforsyth
  • Exclude librmm.so from auditwheel (#18691) @bdice
  • Add C++ tests for different compression implementations (#18690) @vuule
  • Improve runtime of cuDF Python unit tests (#18689) @mroeschke
  • Require at least numba-cuda 0.10.1 (#18688) @brandon-b-miller
  • Add nvidia-cuda-{nvrtc, nvcc} as a dependency for cuDF wheels (#18686) @brandon-b-miller
  • Support rolling aggregations in in-memory cudf-polars execution (#18681) @wence-
  • Replace parquet_blocksize with target_partition_size (#18669) @rjzamora
  • Skip testlargeuniquecategoriesrepr in CI (#18666) @bdice
  • Locally import pyarrow.dataset and fsspec for import cudf performance (#18663) @mroeschke
  • Disable arm64 python tests (#18662) @galipremsagar
  • Pin numba-cuda>=0.9.0,!=0.10.0 due to CI hangs on ARM (#18661) @mroeschke
  • Fix compile warnings in Java JNI (#18660) @ttnghia
  • Drop Empty nodes from IR graph (#18658) @rjzamora
  • Add support for Python 3.13 (#18648) @gforsyth
  • Cleanup libcudf detail/aggregation.hpp/.cuh (#18642) @davidwendt
  • Skip all known pytest failures in pandas-tests (#18641) @galipremsagar
  • Preserve partitioning after Filter and Projection in cudf-polars (#18638) @rjzamora
  • Support quantile in cudf-polars grouped aggregations (#18634) @wence-
  • Deprecate Series.nullmask, Series.nullable, Series.fromcategorical, Series.frommasked_array, cudf.isclose (#18631) @mroeschke
  • Access private objects by importing from module instead of cudf.core/util namespace (#18629) @mroeschke
  • Replace unnecessary cudf::size_of() calls with sizeof() (#18628) @davidwendt
  • Improve cold cache dropping (#18626) @kingcrimsontianyu
  • Improve default config values for cudf-polars streaming (#18623) @rjzamora
  • Add gtest error check for nvtext::wordpiece_tokenize (#18621) @davidwendt
  • Polars dataframe serialize using chunked pack (#18614) @madsbk
  • xfail all known errors in pandas-test suite (#18612) @galipremsagar
  • Add TemporalBaseColumn as a parent class to DatetimeColumn and TimedeltaColumn (#18611) @mroeschke
  • Update cudf::cast internal function to use sizeof instead of cudf::size_of (#18607) @davidwendt
  • Move cudf/utils/utils.py methods to appropriate locations (#18605) @mroeschke
  • pylibcudf.Column: add device_buffer_size and register a dask.sizeof function for cudf-polars Column and DataFrame (#18602) @madsbk
  • Use cached_property for Datetime and Timedelta column properties (#18601) @mroeschke
  • Annotate and simplify from_arrow (#18600) @mroeschke
  • Enable reporting peak memory usage for gtests (#18599) @davidwendt
  • Prune methods from Frame that are specific to subclasses (#18597) @mroeschke
  • Switch tensorflow integration tests to use 12.x (#18596) @galipremsagar
  • refactor: use libnvcomp from libkvikio wheel to unblock Python 3.13 upgrade (#18593) @gforsyth
  • Add temporary pdsh benchmarks to cudf_polars.experimental (#18592) @rjzamora
  • Update numba-cuda dependency to &gt;=0.9.0 (#18591) @brandon-b-miller
  • use 'certifi' certificates in fetchpandasversions script (#18588) @jameslamb
  • Add nvtext substring duplication APIs (Part 1) (#18585) @davidwendt
  • Bump polars version to <1.29 (#18581) @Matt711
  • Allow datetime.timedelta objects in pylibcudf.Scalar.from_py (#18577) @mroeschke
  • Rework strings split_helper utility for better reuse (#18575) @davidwendt
  • Additional tests strings for strings split APIs (#18574) @davidwendt
  • Support datetime.datetime objects in pylibcudf.Scalar.from_py (#18572) @mroeschke
  • Store Python scalars instead of PyArrow Scalars in cudf_polars Literal expr (#18563) @mroeschke
  • Support plc.Scalar.from_py(None) and plc.Scalar.from_py(int, float type) (#18559) @mroeschke
  • Add xfail window function tests for cudf_polars (#18557) @btepera
  • Add fast paths to Series.to_cupy and Series.values (#18555) @Matt711
  • Reduce cudf-polars pyarrow usage (#18554) @vyasr
  • Avoid possible invalid kernel grid error in cudf::set_null_masks if no bitmasks to set (#18553) @mhaseeb123
  • Adjust cudf Python groupby test for cuCollections update (#18550) @mroeschke
  • Refactor scan test I/O logic into shared make_partitioned_source helper (#18542) @Matt711
  • Download build artifacts from Github for CI jobs (#18539) @VenkateshJaya
  • Update hypothesis version (#18537) @galipremsagar
  • Make Python testing dependencies more specific to pylibcudf vs cudf (#18535) @mroeschke
  • Pin hypothesis<6.131.1 due to performance issues (#18532) @mroeschke
  • Deduplicate parquet physical type enums (#18526) @mhaseeb123
  • Reduce the number of miscellaenous pandas unit tests run with cudf.pandas (#18524) @mroeschke
  • Improve nvtext::tokenizewithvocabulary performance (#18522) @davidwendt
  • Make pylibcudf.Column.fromrmmbuffer a Python staticmethod (#18521) @mroeschke
  • Add more short circuit checks for .equals (#18520) @mroeschke
  • Add synchronous task scheduler to cudf-polars (#18519) @rjzamora
  • Don't fetch dlpack headers when building cuDF Python (#18518) @mroeschke
  • Refactor polars configuration (#18516) @TomAugspurger
  • Refactor internal strings utility to separate header and definition file (#18514) @davidwendt
  • Fix print() keyword argument in cudf pandas test (#18513) @trxcllnt
  • Improve performance of strings split-record on whitespace (#18510) @davidwendt
  • Use cuda::std::iter_value_t instead of thrust iterator traits (#18509) @miscco
  • Remove redundant task-graph logic for streaming GroupBy (#18507) @rjzamora
  • Replace GPU_ARCHS build variable by CMAKE_CUDA_ARCHITECTURES (#18506) @ttnghia
  • Optimize pandas metadata generation to reduce memory pressure (#18505) @galipremsagar
  • Replace deprecated hostbuffer in favor of hostspan in SourceInfo (#18503) @Matt711
  • Add pylibcudf.Column.fromrmmbuffer (#18502) @mroeschke
  • Replace thrust functors with libcu++ ones (#18500) @miscco
  • Rename cudf-polars executors (#18499) @rjzamora
  • Remove casting functions in pylibcudf utils (#18497) @Matt711
  • Increase wheel size limit. (#18487) @bdice
  • Add CategoricalIndex.from_codes (#18485) @mroeschke
  • Split join header (#18484) @shrshi
  • Fix unspecified behavior involving move semantics and order of evaluation (#18481) @kingcrimsontianyu
  • Remove need for tocudfcompatible_scalar (#18477) @mroeschke
  • Rerun flaky pytests in CI (#18476) @galipremsagar
  • Vendor RAPIDS.cmake (#18473) @bdice
  • Add ARM conda environments. (#18470) @bdice
  • Bump polars version to <1.28 (#18469) @Matt711
  • Add sink support in cudf_polars (#18468) @mroeschke
  • Enable rapidsmpf spilling in cudf-polars (#18461) @madsbk
  • Promote Parquet type enums to enum classes (#18441) @mhaseeb123
  • Consolidate logic in DataFrame.init for listlike arguments (#18439) @mroeschke
  • Update compression formats supported in JSON reader (#18438) @shrshi
  • Disabled Jitify Minification (#18436) @lamarrr
  • Fix printing decimal128 types that are zero (#18435) @trxcllnt
  • Replace direct use of nvCOMP and of its adapter with the higher-level decompression API (#18434) @vuule
  • Add more cudf.DataFrame constructor pytest benchmarks (#18433) @mroeschke
  • Test against stable tags for narwhals (#18431) @Matt711
  • Refcount-based dropping of cached evaluations in cudf-polars executor (#18430) @wence-
  • Replace Thrust iterator facilities with libcu++ ones (#18427) @miscco
  • Remove numpy requirement when converting 2d cuda array interface objects to pylibcudf Columns (#18426) @Matt711
  • Share more cudf.Column methods for indices_of/isin (#18423) @mroeschke
  • Switch the ptr type in gpumemoryview from Pyssizet to uintptr_t (#18419) @Matt711
  • Add strings::extract_single API (#18417) @davidwendt
  • Add toarrowhost_stringview interop API (#18416) @davidwendt
  • Start removal of vector factories with _sync suffix by deprecating them and adding versions without the suffix (#18414) @vuule
  • Allow polars arrow conversion to produce string_view (#18413) @wence-
  • Change dask_cudf.to_parquet behavior for local filesystems (#18408) @rjzamora
  • Add rank and label_bin methods to ColumnBase (#18407) @mroeschke
  • Improve performance of strings::like for long strings (#18406) @davidwendt
  • Automatic single-partition fallback in cudf-polars (#18405) @rjzamora
  • Remove _sync suffix from hostdevice types (#18404) @vuule
  • Use owning Arrow types in C++ to expose data to Python (#18402) @vyasr
  • add static push and pop methods to NvtxRange (#18401) @zpuller
  • Deprecate cudf.Scalar (#18394) @mroeschke
  • Bump polars version to <1.27 (#18387) @Matt711
  • Branch 25.06 merge 25.04 (#18380) @Matt711
  • Silence warning by setting BUILDSHAREDLIBS (#18371) @vyasr
  • Rewrite groupby aggregations in cudf-polars to simplify evaluation (#18369) @wence-
  • Pass stream through when taking ownership from libcudf (#18367) @wence-
  • Expose new groupedrangerolling API in pylibcudf (#18365) @wence-
  • Avoid patching sort algorithms from CCCL (#18364) @miscco
  • Deprecate old nvtext::normalize_characters (#18360) @davidwendt
  • refactor(rattler): enable strict channel priority for builds (#18358) @gforsyth
  • Optimize sequences by introducing make_offsets_child_column (#18357) @ustcfy
  • Decompress all data in a single decompress_page_data when reading Parquet input in a single chunk (#18352) @vuule
  • Moving wheel builds to specified location and uploading build artifacts to Github (#18346) @VenkateshJaya
  • Performance improvement for tolower/toupper for multi-byte UTF-8 characters (#18345) @davidwendt
  • Branch 25.06 merge branch 25.04 (#18344) @vyasr
  • Use dask-cuda for cudf-polars experimental testing (#18343) @rjzamora
  • Deprecate nvtext subword tokenizer (#18334) @davidwendt
  • Remove cudf.Scalar in as_column (#18331) @mroeschke
  • Add tests for cudf.polars to be able to work on a cpu-only machine (#18327) @galipremsagar
  • Allow cudf.DataFrame.from_pylibcudf to accept a pylibcudf.io.TableWithMetadata (#18319) @mroeschke
  • Avoid stateful construction in DataFrame.__init__ (#18306) @mroeschke
  • Improve the groupby performance for extremely low cardinality (#18290) @PointKernel
  • Remove extranous modules from top level cudf namespace (#18287) @mroeschke
  • Require type annotations in cudf.polars (#18285) @TomAugspurger
  • Removing unnecessary StreamSynchronization in reading (#18279) @JigaoLuo
  • Update to CCCL 2.8.x with no CCCL patches (#18235) @bdice
  • Reduce register pressure for computecolumnkernel (#18226) @matal-nvidia
  • Use the mapped buffer for all read operations in the memory-mapped source; switch default source to the kvikIO one (#18204) @vuule
  • Improve test coverage in the catboost integration tests (#18126) @Matt711
  • Create file sources in parallel (#18094) @vuule
  • Enable stumpy_distributed tests (#17969) @galipremsagar
  • Refactor distinct join to use primitive row operators when proper (#17726) @PointKernel
  • Update chunked parquet reader benchmarks (#16543) @sdrp713

- C++
Published by raydouglass 9 months ago

https://github.com/rapidsai/cudf - [NIGHTLY] v25.08.00

πŸ”— Links

🚨 Breaking Changes

  • Remove deprecated Series methods, isclose (#18947) @mroeschke
  • Remove deprecated groupby.collect (#18946) @mroeschke
  • Remove deprecated get_dummies(cats=, ...) (#18944) @mroeschke
  • Remove cudf.Scalar (#18927) @mroeschke
  • Remove deprecated cudf::io::host_buffer (#18881) @Matt711

πŸ› Bug Fixes

  • Fix flaky custreamz test (#18961) @TomAugspurger

πŸ“– Documentation

  • Update cuDF Python library design with BaseIndex and pylibcudf updates (#18903) @mroeschke

πŸš€ New Features

  • Add CLI argument to enable OOM protection in PDS-H (#18914) @pentschev

πŸ› οΈ Improvements

  • add 'rapids-init-pip' to testcudfpolarspolarstests.sh (#18951) @jameslamb
  • parameterized ucx / ucxx (#18949) @quasiben
  • Remove deprecated Series methods, isclose (#18947) @mroeschke
  • Remove deprecated groupby.collect (#18946) @mroeschke
  • Remove deprecated get_dummies(cats=, ...) (#18944) @mroeschke
  • Add .python_typecode and .typestr attributes to DataType (#18941) @Matt711
  • Remove cudf.Scalar (#18927) @mroeschke
  • Add #pragma once to prevent redundant includes and speed up compilation (#18925) @PointKernel
  • Branch 25.08 merge branch 25.06 (#18895) @vyasr
  • Remove deprecated cudf::io::host_buffer (#18881) @Matt711
  • Apply linter suggestions to cuIO code (#18876) @vuule
  • xfail pandas unit tests that fail with cudf.pandas (#18872) @mroeschke
  • Branch 25.08 merge branch 25.06 (#18855) @vyasr
  • Auto merge fix for branch-25.08 (#18824) @davidwendt
  • Forward-merge branch-25.06 to branch-25.08 (#18817) @Matt711
  • Forward-merge branch-25.06 to branch-25.08 (#18756) @Matt711
  • Fix auto merge conflict for branch-25.08 (#18733) @davidwendt
  • Forward-merge branch-25.06 to branch-25.08 (#18698) @Matt711
  • Fix merge conflict for auto-merger 25.06 to 25.08 (#18693) @davidwendt
  • Fix merge conflict: branch-25.06 into branch-25.08 (#18668) @davidwendt
  • Make cuda12 as JNI default (#18651) @pxLi
  • Forward-merge branch-25.06 into branch-25.08 (#18647) @bdice
  • Fix merge branch-25.06 into branch-25.08 (#18622) @davidwendt

- C++
Published by rapids-bot[bot] 9 months ago

https://github.com/rapidsai/cudf - v25.04.00

🚨 Breaking Changes

  • Remove unused group_range_rolling_window API (#18313) @wence-
  • [BUG] Disabled JIT for CUDA Runtime < 11.5 (#18296) @lamarrr
  • Remove cudf.Scalar from binops (#18240) @mroeschke
  • Enforce deprecation of dtype parameter in sum/product (#18070) @mroeschke
  • Remove deprecated single component datetime extract APIs (#18010) @Matt711
  • Remove deprecated rolling window functionality (#17993) @wence-
  • Remove deprecated nvtext::minhash_permuted APIs (#17939) @davidwendt
  • Remove dataframe protocol (#17909) @vyasr
  • Use new rapids-logger library (#17899) @vyasr
  • Added Multi-input & Scalar Support for Transform UDFs (#17881) @lamarrr
  • Fixed incorrect PTX parsing of ret instruction after branch label (#17859) @lamarrr
  • Use KvikIO to enable file's fast host read and host write (#17764) @kingcrimsontianyu

πŸ› Bug Fixes

  • Fix alpha versions of cudf package. (#18429) @bdice
  • Backport: Deterministic hashing for DataFrameScan nodes in cudf-polars multi-partition executor (#18351) (#18420) @bdice
  • Skip failing Narwhals rolling groupy tests (#18398) @Matt711
  • Pin cmake in test_java to be less than 4.0.0 (#18392) @abellina
  • Skip polars tests that fail with pydantic deprecation warnings (#18388) @Matt711
  • Backport: Fix index of right table in unary operators in AST, in Joins (#18342) @bdice
  • xfail narwhals sqlframe tests (#18297) @Matt711
  • [BUG] Disabled JIT for CUDA Runtime < 11.5 (#18296) @lamarrr
  • Make a pylibcudf Column from a device array object with strides=None (#18295) @Matt711
  • Fix cudf.pandas objects to not be Callable (#18288) @galipremsagar
  • Skip failing polars test testgeneralprefiltering (#18264) @Matt711
  • Filter all cudf.pandas profiler tests from running in parallel (#18262) @Matt711
  • Allow cudf.Series([pd.NA], dtype=, nanasnull=False) (#18259) @mroeschke
  • Fix cross join with extra columns (#18256) @galipremsagar
  • Fix Dataframe.loc to not modify the actual dataframe (#18254) @galipremsagar
  • Remove RMM macro usage from toarrowdevice.cu (#18252) @davidwendt
  • Skip Narwhals cross join tests for cudf.pandas CI run (#18249) @Matt711
  • Fix cudf-polars tests for polars < 1.24 (#18246) @wence-
  • Fix experimental cudf-polars tests (#18244) @rjzamora
  • Fix datetime64 vs datetime binops max resolution (#18241) @galipremsagar
  • Use CCCL::libcudacxx include directories in Jitify preprocessing. (#18233) @bdice
  • Disable conda prefix patching to avoid mangling binaries (#18225) @vyasr
  • Workaround for ARM compiler issue with single space literal string (#18220) @davidwendt
  • Bump nightly check limit (#18213) @Matt711
  • Support comparitive binops between catgorical and non categorical (#18200) @mroeschke
  • Make the version file inside cudf.pandas not a symlink (#18198) @vyasr
  • Ensure RAPIDSARTIFACTSDIR is set for build metrics reports. (#18192) @bdice
  • Ignore run exports of libcufile. (#18190) @bdice
  • Skip flaky multi GPU test (#18187) @Matt711
  • Fix BPE merges table static-map capacity size (#18184) @davidwendt
  • Drop CUB_QUOTIENT_CEILING (#18179) @miscco
  • Disable ARM CI in C++ and Python test CI jobs (#18175) @Matt711
  • Add fmt to the test/benchmarks env (#18173) @vyasr
  • Fix merge(how=left, lefton=, rightindex=True, sort=True) (#18166) @mroeschke
  • Allow nonnative cupy dtype in cudf.Series (#18164) @mroeschke
  • Fix Series construction from numpy array with non-native byte order (#18151) @mroeschke
  • Use protocol for dlpack instead of deprecated function in cupy notebook (#18147) @Matt711
  • Skip failing test (#18146) @vyasr
  • Update calls to KvikIO's config setter (#18144) @kingcrimsontianyu
  • Reduce memory use when writing tables with very short columns to ORC (#18136) @vuule
  • Handle empty dictionary in toarrowdevice interop (#18121) @davidwendt
  • Allow pivot_table to accept single label index and column arguments (#18115) @mroeschke
  • Preserve DataFrame.column subclass and type during binop (#18113) @mroeschke
  • Fix rmm macro call (#18108) @pmattione-nvidia
  • Add include for &lt;functional&gt; (#18102) @miscco
  • Remove static column vectors from window function tests. (#18099) @mythrocks
  • Fix scatterbymap with spilling enabled (#18095) @mroeschke
  • Use the right version macro CCCL_MAJOR_VERSION (#18073) @miscco
  • Fix test_scan_csv_multi cudf-polars test (#18064) @rjzamora
  • Fix memcopy direction for concatenate (#18058) @tgujar
  • Fix upstream dask loc test (#18045) @rjzamora
  • Fix hang on invalid UTF-8 data in string_view iterator (#18039) @davidwendt
  • Fix dask_cudf.to_orc deprecation (#18038) @rjzamora
  • Compatibility with dask.dataframe's is_scalar (#18030) @TomAugspurger
  • Fix the build error due to KvikIO update (#18025) @kingcrimsontianyu
  • Fix failing ibis test (#18022) @Matt711
  • Skip failing polars tests (#18015) @Matt711
  • Fix to_arrow to return consistent pandas-metadata (#18009) @galipremsagar
  • Prevent setting custom attributes to ColumnMethods (#18005) @galipremsagar
  • Compatibility with Dask main (#17992) @TomAugspurger
  • [Bug] Fix Parquet-metadata sampling in cudf-polars (#17991) @rjzamora
  • Add missing include for calling std::iota() (#17983) @davidwendt
  • Fix pickle and unpickling for all objects (#17980) @galipremsagar
  • Install duckdb the default backend for ibis in the cudf.pandas integration tests (#17972) @Matt711
  • Check null count too in sum aggregation (#17964) @Matt711
  • Raise NotImplementedError for groupby.agg if duplicate columns would be created (#17956) @mroeschke
  • Ensure disabling the module accelerator is thread-safe (#17955) @vyasr
  • Fix DataFrame/Series.rank for int and null data in mode.pandas_compatible (#17954) @mroeschke
  • Limit buffer size in reallocation policy in JSON reader (#17940) @shrshi
  • Make cudf.pandas proxy array picklable (#17929) @Matt711
  • Add missing standard includes (#17928) @miscco
  • Fix torch integration test (#17923) @Matt711
  • Fix to_pandas writable bug for datetime and timedelta types (#17913) @galipremsagar
  • Raise NotImplementedError if .merge(suffixes=) introduces duplicate labels (#17905) @mroeschke
  • Fix groupby scans with int and NA data in mode.pandas_compatible (#17895) @mroeschke
  • Patch __init__ of cudf constructors to parse through cudf.pandas proxy objects (#17878) @galipremsagar
  • Fixed incorrect PTX parsing of ret instruction after branch label (#17859) @lamarrr
  • Relax inconsistent schema handling in dask_cudf.read_parquet (#17554) @rjzamora

πŸ“– Documentation

  • Clarify that cudf.pandas should be enabled before importing pandas. (#18339) @bdice
  • [DOC] Add wordpiece tokenizer to cudf documentation (#18247) @davidwendt
  • Added pylibcudf.contiguous_split to API docs (#18194) @TomAugspurger
  • Fix build.sh docs for default behavior (#18180) @bdice
  • Update Dask-cuDF documentation to fix all warnings and errors (#18157) @TomAugspurger
  • [DOC] Document character normalizer (#18125) @Matt711

πŸš€ New Features

  • Add and revise experimental cudf-polars config options (#18284) @rjzamora
  • Support top-k and bottom_k expressions (#18222) @Matt711
  • Support cudf-polars is_leap_year (#18212) @brandon-b-miller
  • Support cudf-polars month_start/month_end (#18211) @brandon-b-miller
  • Support cudf-polars ordinal_day (#18152) @brandon-b-miller
  • Add pylibcudf.gpumemoryview support for len()/nbytes (#18133) @pentschev
  • Link to libzstd for ZSTD compression and decompression APIs (#18129) @shrshi
  • Added NDSH Q09 Benchmark for Transforms (#18127) @lamarrr
  • Make pylibcudf traits raise exceptions gracefully rather than terminating in C++ (#18117) @Matt711
  • Host decompression (#18114) @vuule
  • Add owning types to hold Arrow data (#18084) @vyasr
  • Bump polars version to <1.24 (#18076) @Matt711
  • Support sorted merges in cudf.polars (#18075) @Matt711
  • Add a slice expression to polars IR (#18050) @Matt711
  • Expose num_rows_per_source (IO metadata) to pylibcudf (#18049) @Matt711
  • Added Imbalanced Tree Benchmarks for Transforms (#18032) @lamarrr
  • Run the narwhals test suite with cudf.pandas (#18031) @Matt711
  • Add host_read_async interfaces to datasource (#18018) @vuule
  • Make most cudf-polars Node objects pickleable (#17998) @rjzamora
  • Add Column.serialize to cudf-polars (#17990) @rjzamora
  • Bump polars version to <1.23 (#17986) @Matt711
  • Implemented Decimal Transforms (#17968) @lamarrr
  • Introduce ZSTD host-side compression and decompression APIs (#17935) @shrshi
  • Add catboost integration tests (#17931) @Matt711
  • [FEA] Expose stripe_size_rows setting for ORCWriterOptions (#17927) @ustcfy
  • Test narwhals in CI (#17884) @bdice
  • Added Multi-input & Scalar Support for Transform UDFs (#17881) @lamarrr
  • Host Snappy compression (#17824) @vuule
  • Run spark-rapids-jni CI (#17781) @KyleFromNVIDIA
  • Add multi-partition Shuffle operation to cuDF Polars (#17744) @rjzamora
  • Added polynomials benchmark (#17695) @lamarrr
  • Add stream parameters in pylibcudf IO APIs (#17620) @Matt711
  • New nvtext::wordpiece_tokenizer APIs (#17600) @davidwendt
  • Add support for unary negation operator (#17560) @Matt711
  • Add multi-partition Join support to cuDF-Polars (#17518) @rjzamora
  • Add basic multi-partition GroupBy support to cuDF-Polars (#17503) @rjzamora
  • Support Distributed in cudf-polars tests and IR evaluation (#17364) @pentschev

πŸ› οΈ Improvements

  • Use pyarrow 15 in oldest dependency CI jobs (#18409) @bdice
  • Bump librdkafka to 2.8.0 (#18370) @raydouglass
  • fix(rattler): ignore libzlib run dependency to avoid pandoc collision (#18368) @gforsyth
  • Fix zstd build interface include definition (#18366) @trxcllnt
  • test: Install pytest-env and hypothesis in test_narwhals.sh (#18337) @MarcoGorelli
  • Remove unused group_range_rolling_window API (#18313) @wence-
  • Cache column view creation from arrow types (#18302) @vyasr
  • Split Narwhals cudf.pandas tests failures into to fix and to skip (#18267) @mroeschke
  • Support BinOp, min, and max Aggregations in cudf-polars parallel groupby (#18266) @TomAugspurger
  • Minor clean up and optimizations in the Parquet writer (#18258) @vuule
  • Fix cudf_kafka run export for cudatoolkit (#18245) @gforsyth
  • dask-polars: use splat everywhere. (#18243) @madsbk
  • Remove cudf.Scalar from binops (#18240) @mroeschke
  • Remove warning in the stream pool when asking for more streams than available (#18236) @vuule
  • Explain why we disable parallelism for profiler tests to avoid pytest-cov issue (#18234) @Matt711
  • Ignore cudatoolkit run exports by name, not package (#18230) @gforsyth
  • Revert "Bump nightly check limit" (#18227) @Matt711
  • Fix cudf.pandas to be able to work on a cpu-only machine (#18224) @galipremsagar
  • Add missing cudatoolkit run_export ignore to pylibcudf (#18223) @gforsyth
  • Remove cudf.Scalar from Column.setitem (#18221) @mroeschke
  • Remove unused rounduppow2 utility (#18218) @PointKernel
  • Add flake8-print/debugger Ruff rules (#18217) @mroeschke
  • Bump polars version to <1.25 (#18209) @Matt711
  • Export RAPIDSARTIFACTSDIR. (#18208) @bdice
  • Drop more thrust functions with libcu++ ones (#18207) @miscco
  • Update Numpy <2.1 unpinning xfail condition (#18203) @mroeschke
  • Run conda import tests on Python packages (#18197) @bdice
  • fix(rattler): add cudatoolkit ignore run export to cudf (#18195) @gforsyth
  • Revert "Disable ARM CI in C++ and Python test CI jobs" (#18188) @Matt711
  • Define Column.where to be used across DataFrame/Series (#18186) @mroeschke
  • Remove cudf.Scalar in where (#18178) @mroeschke
  • Drop unnecessary fmt dep (#18177) @vyasr
  • Refactor join internals: separate hash_join declaration and cleanup (#18170) @PointKernel
  • Add Ruff rule to enforce cudf dtype utils over numpy/pandas dtype utils (#18169) @mroeschke
  • Combine multiple str.minhash() APIs into one call (#18168) @davidwendt
  • Move nanoarrowutils.hpp from cpp/tests/interop to cpp/include/cudftest (#18163) @davidwendt
  • Test cudf against the latest stable branch of Narwhals (#18162) @Matt711
  • fix libcudf pins cu11 (#18161) @gforsyth
  • Combine separate ConfigureNVBench calls to fix cpp conda builds (#18155) @gforsyth
  • Add telemetry to build workflows (#18154) @gforsyth
  • Prune more seldom used dtype utils (#18150) @mroeschke
  • Remove some unnecessary module imports (#18143) @mroeschke
  • Branch 25.04 merge branch 25.02 (#18142) @vyasr
  • Prune some seldom used dtype utils (#18141) @mroeschke
  • Use more, cheaper dtype checking utilities in cudf Python (#18139) @mroeschke
  • Support deserializing cudf-polars objects composed of RMM frames (#18138) @pentschev
  • Add ConfigOptions convenience class to cudf-polars (#18137) @rjzamora
  • Support new callback API for lazyframe.profile (#18132) @wence-
  • Optimized compilation of CUDFTESTUTIL's interface sources (#18131) @lamarrr
  • Unpin numpy<2.1 (#18128) @mroeschke
  • Use cpu16 for build CI jobs (#18124) @bdice
  • Remove now non-existent job (#18123) @vyasr
  • Minor typo fix in filling.pxd (#18120) @davidwendt
  • Replace more deprecated CUB functors (#18119) @miscco
  • Simplify DecimalDtype and DecimalColumn operations (#18111) @mroeschke
  • Add interop support from arrow StringView to libcudf strings column (#18107) @davidwendt
  • Expose the Number of Filtered Parquet Rowgroups (IO Metadata) to pylibcudf (#18106) @JigaoLuo
  • Add a list of expected failures to narwhals tests (#18097) @Matt711
  • Remove unused var (#18096) @vyasr
  • Run narwhals tests nightly. (#18093) @bdice
  • Use conda-build instead of conda-mambabuild (#18092) @bdice
  • Remove static configure step (#18091) @vyasr
  • Remove FindCUDAToolkit.cmake from .pre-commit-config.yaml (#18087) @KyleFromNVIDIA
  • Align StringColumn constructor with ColumnBase base class (#18086) @mroeschke
  • Remove FindCUDAToolkit backport (#18081) @KyleFromNVIDIA
  • Support melt(ignore_index=False) (#18080) @mroeschke
  • Update numba dep and upper-bound numpy (#18078) @vyasr
  • Add as_proxy_object API to cudf.pandas (#18072) @galipremsagar
  • Enforce deprecation of dtype parameter in sum/product (#18070) @mroeschke
  • send sccache logs to telemetry (#18069) @msarahan
  • Short circuit Index.equal if compared Index isn't same type (#18067) @mroeschke
  • Make Column.view/cancastsafely accept a dtype object (#18066) @mroeschke
  • Optimization improvement for substr in cudf::string_view (#18062) @davidwendt
  • Forward-merge branch-25.02 to branch-25.04 (#18061) @bdice
  • Port all conda recipes to rattler-build (#18054) @gforsyth
  • Minor improvements in arrow interop (#18053) @wence-
  • Pass more dtype objects to astype calls (#18044) @mroeschke
  • Forward merge branch-25.02 to branch-25.04 (#18041) @Matt711
  • Replace deprecated CCCL features (#18036) @miscco
  • Separate stats filtering helpers to reuse in page pruning (#18034) @mhaseeb123
  • Update spark-rapids-jni CI image version to cuda12.8.0 (#18024) @pxLi
  • Add pylibcudf.Scalar.from_numpy for bool/int/float/str types (#18020) @mroeschke
  • Support IntervalDtype(subtype=None) (#18017) @mroeschke
  • Enable pytest-xdist runs for py-polars tests (#18016) @galipremsagar
  • consolidate more conda solves in CI (#18014) @jameslamb
  • Replace cub::Int2Type with cuda::std::integral_constant (#18013) @miscco
  • Remove deprecated single component datetime extract APIs (#18010) @Matt711
  • Pass dtype objects to Column.astype (#18008) @mroeschke
  • Require CMake 3.30.4 (#18007) @robertmaynard
  • Refactor math_ops.cu dispatcher logic (#18006) @davidwendt
  • Move cudf::lists::detail::makeemptylists_column to public API (#17996) @davidwendt
  • Create Conda CI test env in one step (#17995) @KyleFromNVIDIA
  • Add seed parameter to cudf hashcharacterngrams (#17994) @davidwendt
  • Remove deprecated rolling window functionality (#17993) @wence-
  • Continue on failures in cudf.pandas integration tests CI job (#17987) @Matt711
  • Avoid cudf.dtype calls in buildcolumn/columnempty/.where (#17979) @mroeschke
  • Ensure dtype objects are passed within Column.astype (#17978) @mroeschke
  • Use Conda XGBoost (#17959) @jakirkham
  • Read the footers in parallel when reading multiple Parquet files (#17957) @vuule
  • Refactor predicate pushdown to reuse row group pruning in experimental PQ reader (#17946) @mhaseeb123
  • Add new nvtext tokenized minhash API (#17944) @davidwendt
  • Use shared-workflows branch-25.04 (#17943) @bdice
  • Get rid of the deprecated thrust::identity (#17942) @PointKernel
  • Remove deprecated nvtext::minhash_permuted APIs (#17939) @davidwendt
  • Enable third party library integration tests in CI with cudf.pandas (#17936) @galipremsagar
  • Add build_type input field for test.yaml (#17925) @gforsyth
  • Remove cudf.Scalar from shift/fillna (#17922) @mroeschke
  • Enabling cross join in cudf python (#17921) @galipremsagar
  • Use rapids-pip-retry in CI jobs that might need retries (#17920) @gforsyth
  • More avoid cudf.dtype internally in favor of pre-defined, supported types (#17918) @mroeschke
  • Initialize inout parameter (#17911) @miscco
  • Remove dataframe protocol (#17909) @vyasr
  • Rename PascalCase functions and types to to snake_case to improve consistency (#17908) @vuule
  • Use new rapids-logger library (#17899) @vyasr
  • Add pylibcudf.Scalar.from_py for construction from Python strings, bool, int, float (#17898) @mroeschke
  • Remove cudf.Scalar from factorize (#17897) @mroeschke
  • disallow fallback to Make in Python builds (#17894) @jameslamb
  • Remove orc::gpu namespace (#17891) @vuule
  • Only run Auto Assign PR workflow if PR is not merged (#17888) @mroeschke
  • Update pre-commit-hooks to version 0.6.0 (#17887) @KyleFromNVIDIA
  • Forward-merge branch-25.02 to branch-25.04 (#17885) @bdice
  • Add script to run pylibcudf tests (#17882) @bdice
  • Migrate to NVKS for amd64 CI runners (#17877) @bdice
  • Fix merge conflict for branch-25.02 into branch-25.04 (#17874) @davidwendt
  • Remove decimal32/64 to decimal128 conversion in Parquet writer (#17869) @mhaseeb123
  • Expose JSON reader options to builder in pylibcudf (#17866) @shrshi
  • Remove cudf.Scalar from .dt timedelta properties (#17863) @mroeschke
  • Added support for custom types in PTX parser (#17861) @lamarrr
  • Remove cudf.Scalar from daterange/todatetime (#17860) @mroeschke
  • Avoid cudf.dtype internally in favor of pre-defined, supported types (#17839) @mroeschke
  • Allow cudf::typetoid<T const>() (#17831) @esoha-nvidia
  • Fixing auto-merge branch-25.02 into branch-25.04 (#17828) @davidwendt
  • Add new nvtext::normalize_characters API (#17818) @davidwendt
  • Include more information in error messages in the nvcomp adapter (#17814) @vuule
  • Extend and simplify API for calculation of range-based rolling window offsets (#17807) @wence-
  • More minor fixes for CCCL (#17793) @miscco
  • Use KvikIO to enable file's fast host read and host write (#17764) @kingcrimsontianyu
  • Remove cudf._lib.column in favor of pylibcudf. (#17760) @mroeschke
  • Replaced std::string with std::string_view and removed excessive copies in cudf::io (#17734) @lamarrr
  • Use xdist worksteal on the cudf.pandas test suite (#16930) @Matt711

- C++
Published by AyodeAwe 11 months ago

https://github.com/rapidsai/cudf - [NIGHTLY] v25.06.00

πŸ”— Links

🚨 Breaking Changes

  • Promote Parquet type enums to enum classes (#18441) @mhaseeb123
  • Move parquet schema types and structs to public headers (#18424) @mhaseeb123
  • Start removal of vector factories with _sync suffix by deprecating them and adding versions without the suffix (#18414) @vuule
  • Skip decoding of pages marked as pruned in PQ reader (#18347) @mhaseeb123
  • Deprecate nvtext subword tokenizer (#18334) @davidwendt
  • Add standard data ingestion pipelines to pylibcudf for ndarrays (#18311) @Matt711
  • Add Keep Option Parameter to Distinct (#18237) @warrickhe

πŸ› Bug Fixes

  • Fix cpp examples cmake to use the rapids_config.cmake (#18501) @davidwendt
  • Rename rapidsmp to rapidsmpf (#18493) @rjzamora
  • Fix compilation with the C++20 standard (#18486) @vuule
  • Fix an error when reading some compressed Parquet V2 files (#18478) @vuule
  • Ensure DataFrame column label operations reset label_dtype (#18452) @mroeschke
  • Fix a segfault when reading a Parquet file with unsupported compression type (#18451) @vuule
  • Fix logger macros (#18444) @vyasr
  • Use delete not free to release data allocated with new (#18412) @wence-
  • Fix synchronization issues in host compression and decompression (#18395) @vuule
  • Update Dask array-conversion handling (#18382) @rjzamora
  • Fixed indexing on empty DataFrame with no columns (#18381) @TomAugspurger
  • Deterministic hashing for DataFrameScan nodes in cudf-polars multi-partition executor (#18351) @TomAugspurger
  • Fix index of right table in unary operators in AST, in Joins (#18333) @karthikeyann
  • Add offsetalator to contiguous-split (#18312) @davidwendt
  • Support large strings in nvtext vocabulary-tokenizer (#18283) @davidwendt

πŸ“– Documentation

  • [DOC] Improve clarity in parquet APIs setrowgroups and set_columns parquet (#18466) @Matt711
  • Add a usage page to cudf-polars documentation (#18460) @Matt711
  • [DOC] Fix typo in CONTRIBUTING.md on build type tests (#18456) @JigaoLuo
  • Add restart kernel note in cudf pandas docs (#18374) @ncclementi

πŸš€ New Features

  • Support reading from device buffers in the pylibcudf IO APIs (#18496) @Matt711
  • Move parquet schema types and structs to public headers (#18424) @mhaseeb123
  • Add optional dtype argument to Scalar.from_any (#18415) @Matt711
  • Expose cudf::chunked_pack in pylibcudf (#18411) @wence-
  • Add support for long string columns in cudf::contiguous_split (#18393) @nvdbaranec
  • Automatically dispatch between host and device decompression/compression based on the number of buffers (#18363) @vuule
  • Skip decoding of pages marked as pruned in PQ reader (#18347) @mhaseeb123
  • Support constructing pylibcudf Columns and Tables from views into arbitrary objects (#18314) @vyasr
  • Add standard data ingestion pipelines to pylibcudf for ndarrays (#18311) @Matt711
  • Support cudf-polars isoyear and week (isoweek) (#18265) @brandon-b-miller
  • Add Keep Option Parameter to Distinct (#18237) @warrickhe
  • Add rapidsmp shuffle support to cudf-polars (#18231) @rjzamora
  • Support cudf-polars strftime (#18181) @brandon-b-miller
  • Support include_file_paths in cudf polars (#18057) @Matt711

πŸ› οΈ Improvements

  • Optimize pandas metadata generation to reduce memory pressure (#18505) @galipremsagar
  • Add pylibcudf.Column.fromrmmbuffer (#18502) @mroeschke
  • Replace thrust functors with libcu++ ones (#18500) @miscco
  • Rename cudf-polars executors (#18499) @rjzamora
  • Remove casting functions in pylibcudf utils (#18497) @Matt711
  • Increase wheel size limit. (#18487) @bdice
  • Split join header (#18484) @shrshi
  • Fix unspecified behavior involving move semantics and order of evaluation (#18481) @kingcrimsontianyu
  • Rerun flaky pytests in CI (#18476) @galipremsagar
  • Vendor RAPIDS.cmake (#18473) @bdice
  • Add ARM conda environments. (#18470) @bdice
  • Bump polars version to <1.28 (#18469) @Matt711
  • Promote Parquet type enums to enum classes (#18441) @mhaseeb123
  • Update compression formats supported in JSON reader (#18438) @shrshi
  • Disabled Jitify Minification (#18436) @lamarrr
  • Replace direct use of nvCOMP and of its adapter with the higher-level decompression API (#18434) @vuule
  • Test against stable tags for narwhals (#18431) @Matt711
  • Refcount-based dropping of cached evaluations in cudf-polars executor (#18430) @wence-
  • Replace Thrust iterator facilities with libcu++ ones (#18427) @miscco
  • Remove numpy requirement when converting 2d cuda array interface objects to pylibcudf Columns (#18426) @Matt711
  • Switch the ptr type in gpumemoryview from Pyssizet to uintptr_t (#18419) @Matt711
  • Add strings::extract_single API (#18417) @davidwendt
  • Start removal of vector factories with _sync suffix by deprecating them and adding versions without the suffix (#18414) @vuule
  • Allow polars arrow conversion to produce string_view (#18413) @wence-
  • Add rank and label_bin methods to ColumnBase (#18407) @mroeschke
  • Automatic single-partition fallback in cudf-polars (#18405) @rjzamora
  • Remove _sync suffix from hostdevice types (#18404) @vuule
  • Use owning Arrow types in C++ to expose data to Python (#18402) @vyasr
  • add static push and pop methods to NvtxRange (#18401) @zpuller
  • Deprecate cudf.Scalar (#18394) @mroeschke
  • Bump polars version to <1.27 (#18387) @Matt711
  • Branch 25.06 merge 25.04 (#18380) @Matt711
  • Silence warning by setting BUILDSHAREDLIBS (#18371) @vyasr
  • Pass stream through when taking ownership from libcudf (#18367) @wence-
  • Avoid patching sort algorithms from CCCL (#18364) @miscco
  • Deprecate old nvtext::normalize_characters (#18360) @davidwendt
  • refactor(rattler): enable strict channel priority for builds (#18358) @gforsyth
  • Optimize sequences by introducing make_offsets_child_column (#18357) @ustcfy
  • Decompress all data in a single decompress_page_data when reading Parquet input in a single chunk (#18352) @vuule
  • Performance improvement for tolower/toupper for multi-byte UTF-8 characters (#18345) @davidwendt
  • Branch 25.06 merge branch 25.04 (#18344) @vyasr
  • Use dask-cuda for cudf-polars experimental testing (#18343) @rjzamora
  • Deprecate nvtext subword tokenizer (#18334) @davidwendt
  • Remove cudf.Scalar in as_column (#18331) @mroeschke
  • Allow cudf.DataFrame.from_pylibcudf to accept a pylibcudf.io.TableWithMetadata (#18319) @mroeschke
  • Avoid stateful construction in DataFrame.__init__ (#18306) @mroeschke
  • Improve the groupby performance for extremely low cardinality (#18290) @PointKernel
  • Require type annotations in cudf.polars (#18285) @TomAugspurger
  • Removing unnecessary StreamSynchronization in reading (#18279) @JigaoLuo
  • Use the mapped buffer for all read operations in the memory-mapped source; switch default source to the kvikIO one (#18204) @vuule
  • Improve test coverage in the catboost integration tests (#18126) @Matt711
  • Create file sources in parallel (#18094) @vuule

- C++
Published by rapids-bot[bot] 11 months ago

https://github.com/rapidsai/cudf - v25.02.02

🚨 Breaking Changes

  • Expose stream-ordering in scalar and avro APIs (#17766) @shrshi
  • Add seed parameter to hashcharacterngrams (#17643) @davidwendt
  • Performance improvements and simplifications for fixed size row-based rolling windows (#17623) @wence-
  • Refactor distinct hash join to handle multiple probes with the same build table (#17609) @PointKernel
  • Deprecate cudf::groupedtimerangerollingwindow (#17589) @wence-
  • Remove "legacy" Dask DataFrame support from Dask cuDF (#17558) @rjzamora
  • Return empty result for segmented_reduce if input and offsets are both empty (#17437) @davidwendt
  • Rework minhash APIs for deprecation cycle (#17421) @davidwendt
  • Change indices for dictionary column to signed integer type (#17390) @davidwendt

πŸ› Bug Fixes

  • Use protocol for dlpack instead of deprecated function (#18134) @vyasr
  • Skip the failing connectorx polars tests (#18037) @Matt711
  • Fix 'Unexpected short subpass' exception in parquet chunked reader. (#18019) @nvdbaranec
  • Fix race check failures in shared memory groupby (#17985) @PointKernel
  • Pin ibis version in the cudf.pandas integration tests <10.0.0 (#17975) @Matt711
  • Fix the index type in the indexing operator of the span types (#17971) @vuule
  • Add missing pin (#17915) @vyasr
  • Fix third-party cudf.pandas tests (#17900) @galipremsagar
  • Fix numpy data access by making attribute private (#17890) @galipremsagar
  • Remove extra local var declaration from cudf.pandas 3rd-party integration shell script (#17886) @Matt711
  • Move isinstance_cudf_pandas to fast_slow_proxy (#17875) @galipremsagar
  • Make _Series_dtype method a property (#17854) @Matt711
  • Fix the bug in determining the heuristics for shared memory groupby (#17851) @PointKernel
  • Fix possible OOB mem access in Parquet decoder (#17841) @mhaseeb123
  • Require batches to be non-empty in multi-batch JSON reader (#17837) @shrshi
  • Fix rolling(minperiods=) with int and null data with mode.pandascompat (#17822) @mroeschke
  • Resolve race-condition in disable_module_accelerator (#17811) @galipremsagar
  • Make Series(dtype=object) raise in mode.pandas_compat with non string data (#17804) @mroeschke
  • Disable intended disabled ORC tests (#17790) @davidwendt
  • Fix empty DataFrame construction not returning RangeIndex columns (#17784) @mroeschke
  • Fix various .str methods for pandas compatability (#17782) @mroeschke
  • Fix count API issue about ignoring nan values (#17779) @galipremsagar
  • Add numba pinning to cudf repo (#17777) @galipremsagar
  • Allow .sortvalues(naposition=) to include NaNs in mode.pandas_compatible (#17776) @mroeschke
  • allow deselecting nvcomp wheels (#17774) @jameslamb
  • Use the aligned_resource_adaptor to allocate bloom filter device buffers (#17758) @mhaseeb123
  • Avoid instantiating bloom filter query function for nested and bool types (#17753) @mhaseeb123
  • Fix DataFrame.merge(Series, how="left"/"right") on column and index not resulting in a RangeIndex (#17739) @mroeschke
  • [BUG] xfail Polars excel test (#17731) @Matt711
  • Require to implement AutoCloseable for the classes derived from HostUDFWrapper (#17727) @ttnghia
  • Remove jlowe as a java committer since he retired (#17725) @tgravescs
  • Prevent use of invalid grid sizes in ORC reader and writer (#17709) @vuule
  • Enforce schema for partial tables in multi-source multi-batch JSON reader (#17708) @shrshi
  • Compute and use the initial string offset when building nested large string cols with chunked parquet reader (#17702) @mhaseeb123
  • Fix writing of compressed ORC files with large stripe footers (#17700) @vuule
  • Fix cudf.polars sum of empty not equalling zero (#17685) @mroeschke
  • Fix formatting in logging (#17680) @vuule
  • convert all nulls to nans in a specific scenario (#17677) @galipremsagar
  • Define cudf repr methods on the Column (#17675) @mroeschke
  • Fix groupby.len with null values in cudf.polars (#17671) @mroeschke
  • Fix: DataFrameGroupBy.get_group was raising with length>1 tuples (#17653) @MarcoGorelli
  • Fix possible int overflow in computemixedjoinoutputsize (#17633) @davidwendt
  • Fix a minor potential i32 overflow in thrust::transform_exclusive_scan in PQ reader preprocessing (#17617) @mhaseeb123
  • Fix failing xgboost test in the cudf.pandas third-party integration tests (#17616) @Matt711
  • Fix dask_cudf.read_csv (#17612) @rjzamora
  • Fix memcheck error in ReplaceTest.NormalizeNansAndZerosMutable gtest (#17610) @davidwendt
  • Correctly accept a pandas.CategoricalDtype(pandas.IntervalDtype(...), ...) type (#17604) @mroeschke
  • Add ability to modify and propagate names of columns object (#17597) @galipremsagar
  • Ignore NaN correctly in .quantile (#17593) @mroeschke
  • Fix groupby argmin/max gather of sorted-order indices (#17591) @davidwendt
  • Fix ctest fail running libcudf tests in a Debug build (#17576) @davidwendt
  • Specify a version for rapids_logger dependency (#17573) @jlowe
  • Fix the ORC decoding bug for the timestamp data (#17570) @kingcrimsontianyu
  • [JNI] remove rmm argument to set rw access for fabric handles (#17553) @abellina
  • Document undefined behavior in divroundingup_safe (#17542) @davidwendt
  • Fix nvcc-imposed UB in constexpr functions (#17534) @vuule
  • Add anonymous namespace to libcudf test source (#17529) @davidwendt
  • Propagate failures in pandas integration tests and Skip failing tests (#17521) @Matt711
  • Fix libcudf compile error when logging is disabled (#17512) @davidwendt
  • Fix Dask-cuDF clip APIs (#17509) @rjzamora
  • Fix pylibcudf to_arrow with multiple nested data types (#17504) @mroeschke
  • Fix groupby(as_index=False).size not reseting index (#17499) @mroeschke
  • Revert "Temporarily skip tests due to dask/distributed#8953" (#17492) @Matt711
  • Workaround for a misaligned access in read_csv on some CUDA versions (#17477) @vuule
  • Fix some possible thread-id overflow calculations (#17473) @davidwendt
  • Temporarily skip tests due to dask/distributed#8953 (#17472) @wence-
  • Detect mismatches in begin and end tokens returned by JSON tokenizer FST (#17471) @shrshi
  • Support dask>=2024.11.2 in Dask cuDF (#17439) @rjzamora
  • Fix write_json failure for zero columns in table/struct (#17414) @karthikeyann
  • Fix Debug-mode failing Arrow test (#17405) @zeroshade
  • Fix all null list column with missing child column in JSON reader (#17348) @karthikeyann

πŸ“– Documentation

  • Fix forward merge 24.12->25.02 (#18002) @raydouglass
  • Fix incorrect example in pylibcudf docs (#17912) @Matt711
  • Explicitly call out that the GPU open beta runs on a single GPU (#17872) @taureandyernv
  • Update cudf.pandas colab link in docs (#17846) @taureandyernv
  • [DOC] Make pylibcudf docs more visible (#17803) @Matt711
  • Cross-link cudf.pandas profiler documentation. (#17668) @bdice
  • Document interpreter install command for cudf.pandas (#17358) @bdice
  • add comment to Series.tolist method (#17350) @tequilayu

πŸš€ New Features

  • Bump polars version to <1.22 (#17771) @Matt711
  • Make more constexpr available on device for cuIO (#17746) @PointKernel
  • Add public interop functions between pylibcudf and cudf classic (#17730) @Matt711
  • Support dask_expr migration into dask.dataframe (#17704) @rjzamora
  • Make tests build without relaxed constexpr (#17691) @PointKernel
  • Set default logger level to warn (#17684) @vyasr
  • Support multithreaded reading of compressed buffers in JSON reader (#17670) @shrshi
  • Control pinned memory use with environment variables (#17657) @vuule
  • Host compression (#17656) @vuule
  • Enable text build without relying on relaxed constexpr (#17647) @PointKernel
  • Implement HOST_UDF aggregation for reduction and segmented reduction (#17645) @ttnghia
  • Add JSON reader options structs to pylibcudf (#17614) @Matt711
  • Refactor distinct hash join to handle multiple probes with the same build table (#17609) @PointKernel
  • Add JSON Writer options classes to pylibcudf (#17606) @Matt711
  • Add ORC reader options structs to pylibcudf (#17601) @Matt711
  • Add Avro Reader options classes to pylibcudf (#17599) @Matt711
  • Enable binaryop build without relying on relaxed constexpr (#17598) @PointKernel
  • Measure the number of Parquet row groups filtered by predicate pushdown (#17594) @mhaseeb123
  • Implement HOST_UDF aggregation for groupby (#17592) @ttnghia
  • Plumb pylibcudf.io.parquet options classes through cudf python (#17506) @Matt711
  • Add partition-wise Select support to cuDF-Polars (#17495) @rjzamora
  • Add multi-partition Scan support to cuDF-Polars (#17494) @rjzamora
  • Migrate cudf::io::merge_row_group_metadata to pylibcudf (#17491) @Matt711
  • Add Parquet Reader options classes to pylibcudf (#17464) @Matt711
  • Add multi-partition DataFrameScan support to cuDF-Polars (#17441) @rjzamora
  • Return empty result for segmented_reduce if input and offsets are both empty (#17437) @davidwendt
  • Abstract polars function expression nodes to ensure they are serializable (#17418) @pentschev
  • Add CSV Reader options classes to pylibcudf (#17412) @Matt711
  • Add support for pylibcudf.DataType serialization (#17352) @pentschev
  • Enable rounding for Decimal32 and Decimal64 in cuDF (#17332) @a-hirota
  • Remove upper bounds on cuda-python to allow 12.6.2 and 11.8.5 (#17326) @bdice
  • Expose stream-ordering to groupby APIs (#17324) @shrshi
  • Migrate ORC Writer to pylibcudf (#17310) @Matt711
  • Support reading bloom filters from Parquet files and filter row groups using them (#17289) @mhaseeb123

πŸ› οΈ Improvements

  • Update to nvcomp 4.2.0.11 (#18042) @bdice
  • Remove pandas backend from cudf.pandas - ibis integration tests (#17945) @Matt711
  • Revert CUDA 12.8 shared workflow branch changes (#17879) @vyasr
  • Remove predicate param from DataFrameScan IR (#17852) @Matt711
  • Remove cudf.Scalar from scatter APIs (#17847) @mroeschke
  • Remove cudf.Scalar from interval_range (#17844) @mroeschke
  • Add verify-codeowners hook (#17840) @KyleFromNVIDIA
  • Build and test with CUDA 12.8.0 (#17834) @bdice
  • Increase timeout for recently added test (#17829) @galipremsagar
  • Apply ruff everywhere (notebooks and scripts) (#17820) @bdice
  • Fix pre-commit.ci failures (#17819) @bdice
  • Remove incorrect calls to set architectures (#17813) @vyasr
  • Fix typo in exception raised when attempting to convert a string column to cupy (#17800) @dagardner-nv
  • Add support for pyarrow-19 (#17794) @galipremsagar
  • increase parallelism in nightly builds (#17792) @jameslamb
  • Reduce libcudf memcheck tests output (#17791) @davidwendt
  • Make cudf build with latest CCCL (#17788) @miscco
  • Introduce some more rolling window benchmarks (#17787) @wence-
  • Add shellcheck to pre-commit and fix warnings (#17778) @gforsyth
  • Improve parquet reader very-long string performance (#17773) @pmattione-nvidia
  • Update how to manage host UDF instance (#17770) @res-life
  • Add getInts api for HostMemoryBuffer and UnsafeMemoryAccessor (#17767) @liurenjie1024
  • Expose stream-ordering in scalar and avro APIs (#17766) @shrshi
  • Standarize methods used from cudf.core._internals (#17765) @mroeschke
  • Implement string join in cudf-polars (#17755) @wence-
  • Deprecate dataframe protocol (#17736) @vyasr
  • Add parquet reader long row test (#17735) @pmattione-nvidia
  • Update kvikio call due to upstream changes (#17733) @kingcrimsontianyu
  • Delay setting MultiIndex.level/codes until needed (#17728) @mroeschke
  • Bounding pool size in multi-batch JSON reader (#17724) @shrshi
  • Use GCC 13 in CUDA 12 conda builds. (#17721) @bdice
  • Update minimal sphinx theme version so that we can use parallel doc builds (#17719) @vyasr
  • Add more aggregation methods in pylibcudf (#17717) @mroeschke
  • Make cudf.lib.stringudf work with pylibcudf Columns instead of cudf._lib Columns (#17715) @mroeschke
  • Add special orc test data: timestamp interspersed with null values (#17713) @kingcrimsontianyu
  • Add pylibcudf.nullmask.nullcount (#17711) @mroeschke
  • Ensure pyarrow.Scalar to pylibcudf.Scalar is cached (#17707) @mroeschke
  • Adapt cudf numba config for numba 0.61 removal (#17705) @mroeschke
  • Remove cudf._lib.scalar in favor of pylibcudf (#17701) @mroeschke
  • Fix parquet reader list bug (#17699) @pmattione-nvidia
  • Migrated Dynamic AST Expression Trees in Benchmarks and Tests to use AST Tree (#17697) @lamarrr
  • Skip polars test that can generate timezones that chrono_tz doesn't know (#17694) @wence-
  • Use 64-bit offsets only if the current strings column output chunk size exceeds threshold (#17693) @mhaseeb123
  • Use latest ci-conda images (#17690) @bdice
  • Add multi-source reading to JSON reader benchmarks (#17688) @shrshi
  • Convert cudf.Scalar usage to pylibcudf and pyarrow usage (#17686) @mroeschke
  • remove find_package(Python) in libcudf build (#17683) @jameslamb
  • Fix build metrics report format with long placehold filenames (#17679) @davidwendt
  • Use rapids-cmake for the logger (#17674) @vyasr
  • Java Parquet reads via multiple host buffers (#17673) @jlowe
  • Remove cudf._libs.types.pyx (#17665) @mroeschke
  • Add support for Groupby.cumprod (#17661) @galipremsagar
  • Implement .dt.total_seconds (#17659) @galipremsagar
  • Avoid shallow copies in groupby methods (#17646) @mroeschke
  • Avoid double MultiIndex factorization in groupby index result (#17644) @mroeschke
  • Add seed parameter to hashcharacterngrams (#17643) @davidwendt
  • Fix possible overflow in WriteCoalescingCallbackWrapper::TearDown (#17642) @davidwendt
  • Remove pragma GCC diagnostic from source files (#17637) @davidwendt
  • Move unnecessary utilities from cudf._lib.scalar (#17636) @mroeschke
  • Support compression= in DataFrame.to_json (#17634) @mroeschke
  • Bump Polars version to <1.18 (#17632) @Matt711
  • Add public APIs to Access Underlying cudf and pandas Objects from cudf.pandas Proxy Objects (#17629) @galipremsagar
  • Use Numba Config to turn on Pynvjitlink Features (#17628) @isVoid
  • Use PyNVML 12 (#17627) @jakirkham
  • Remove cudf._lib.utils in favor of python APIs (#17625) @mroeschke
  • Performance improvements and simplifications for fixed size row-based rolling windows (#17623) @wence-
  • Fix return types for MurmurHash3x8632 template specializations (#17622) @davidwendt
  • Clean up namespaces and improve compression-related headers (#17621) @vuule
  • Use more pylibcudf.types instead of cudf._lib.types (#17619) @mroeschke
  • Remove patch that is only needed for clang-tidy to run on test files (#17618) @vyasr
  • update telemetry actions to fluent-bit friendly style (#17615) @msarahan
  • Introduce some simple benchmarks for rolling window aggregations (#17613) @wence-
  • Bump the oldest pyarrow version to 14.0.2 in test matrix (#17611) @galipremsagar
  • Use [[nodiscard]] attribute before __device__ (#17608) @vuule
  • Use host_vector in flatten_single_pass_aggs (#17605) @vuule
  • Stop memory_resource.hpp from including itself (#17603) @vyasr
  • Replace the outdated cuco window concept with buckets (#17602) @PointKernel
  • Check if nightlies have succeeded recently enough (#17596) @vyasr
  • Deprecate cudf::groupedtimerangerollingwindow (#17589) @wence-
  • A couple of fixes in rapids-logger usage (#17588) @vyasr
  • Simplify expression transformer in Parquet predicate pushdown with ast::tree (#17587) @mhaseeb123
  • Remove unused functionality in cudf._lib.utils.pyx (#17586) @mroeschke
  • Use cuda-python cuda.bindings import names. (#17585) @bdice
  • Use no-sync copy for fixed-width types in cudf::concatenate (#17584) @davidwendt
  • Remove cudf._lib.groupby in favor of inlining pylibcudf (#17582) @mroeschke
  • Remove unused code of json schema in JSON reader (#17581) @karthikeyann
  • Expose Scalar's constructor and Scalar#getScalarHandle() to public (#17580) @ttnghia
  • Allow large strings in nvtext benchmarks (#17579) @davidwendt
  • Remove cudf._lib.reduce in favor of inlining pylibcudf (#17574) @mroeschke
  • Use batched memcpy when writing ORC statistics (#17572) @vuule
  • Allow large strings in nvbench strings benchmarks (#17571) @davidwendt
  • Update version references in workflow (#17568) @AyodeAwe
  • Enable all json reader options in pylibcudf read_json (#17563) @karthikeyann
  • Remove cudf._lib.parquet in favor of inlining pylibcudf (#17562) @mroeschke
  • Fix CMake format in cudf/_lib/CMakeLists.txt (#17559) @mroeschke
  • Remove "legacy" Dask DataFrame support from Dask cuDF (#17558) @rjzamora
  • Replace direct cudaMemcpyAsync calls with utility functions (within /include) (#17557) @vuule
  • Remove cudf._lib.interop in favor of inlining pylibcudf (#17555) @mroeschke
  • gate telemetry dispatch calls on TELEMETRY_ENABLED env var (#17551) @msarahan
  • Replace direct cudaMemcpyAsync calls with utility functions (within /src) (#17550) @vuule
  • Remove unused BufferArrayFromVector (#17549) @Matt711
  • Move cudf.lib.copying to cudf.core.internals (#17548) @mroeschke
  • Update cuda-python lower bounds to 12.6.2 / 11.8.5 (#17547) @bdice
  • Fix typos, rename types, and add null_probability benchmark axis for distinct (#17546) @PointKernel
  • Mark more constexpr functions as device-available (#17545) @vyasr
  • Use cooperative-groups instead of cub warp-reduce for strings contains (#17540) @davidwendt
  • Remove cudf._lib.nvtext in favor of inlining pylibcudf (#17535) @mroeschke
  • Add XXHash_32 hasher (#17533) @PointKernel
  • Remove unused masked keyword in column_empty (#17530) @mroeschke
  • Remove Thrust patch in favor of CMake definition for Thrust 32-bit offset types. (#17527) @bdice
  • [JNI] Enables fabric handles for CUDA async memory pools (#17526) @abellina
  • Force Thrust to use 32-bit offset type. (#17523) @bdice
  • Replace cudf::detail::copyif logic with thrust::copyif and gather (#17520) @davidwendt
  • Replaces uses of cudf._lib.Column.from_unique_ptr with pylibcudf.Column.from_libcudf (#17517) @Matt711
  • Move cudf.lib.aggregation to cudf.core.internals (#17516) @mroeschke
  • Migrate copycolumn and Column.fromscalar to pylibcudf (#17513) @Matt711
  • Remove cudf._lib.transform in favor of inlining pylibcudf (#17505) @mroeschke
  • Remove cudf._lib.string.convert/split in favor of inlining pylibcudf (#17496) @mroeschke
  • Move cudf.lib.sort to cudf.core.internals (#17488) @mroeschke
  • Remove cudf._lib.csv in favor in inlining pylibcudf (#17485) @mroeschke
  • Update PyTorch to >=2.4.0 to get fix for CUDA array interface bug, and drop CUDA 11 PyTorch tests. (#17475) @bdice
  • Remove cudf._lib.binops in favor of inlining pylibcudf (#17468) @mroeschke
  • Remove cudf._lib.orc in favor of inlining pylibcudf (#17466) @mroeschke
  • skip most CI on devcontainer-only changes (#17465) @jameslamb
  • Set build type for all examples (#17463) @vyasr
  • Update the hook versions in pre-commit (#17462) @wence-
  • Remove cudf.lib.stringcasting in favor of inlining pylibcudf (#17460) @mroeschke
  • Remove cudf._lib.filling in favor of inlining pylibcudf (#17459) @mroeschke
  • Update MurmurHash3x64128 to use the cuco equivalent implementation (#17457) @PointKernel
  • Move cudf.lib.streamcompaction to cudf.core._internals (#17456) @mroeschke
  • Clean up xxhash_64 implementations (#17455) @PointKernel
  • Update Hadoop dependency in Java pom (#17454) @jlowe
  • Adapt to rmm logger changes (#17451) @vyasr
  • Require approval to run CI on draft PRs (#17450) @bdice
  • Expose stream-ordering in nvtext API (#17446) @shrshi
  • Use execpolicynosync in write_json (#17445) @karthikeyann
  • Remove cudf._lib.json in favor of inlining pylibcudf (#17443) @mroeschke
  • Remove cudf.lib.nullmask in favor of inlining pylibcudf (#17440) @mroeschke
  • Expose stream-ordering in replace API (#17436) @shrshi
  • Expose stream-ordering in copying APIs (#17435) @shrshi
  • Expose stream-ordering in column view APIs (#17434) @shrshi
  • Apply clang-tidy autofixes from new rules (#17431) @vyasr
  • Remove cudf._lib.round in favor of inlining pylibcudf (#17430) @mroeschke
  • Update MurmurHash3x8632 to use the cuco equivalent implementation (#17429) @PointKernel
  • Remove cudf._lib.replace in favor of inlining pylibcudf (#17428) @mroeschke
  • Remove nvtx/ranges.hpp include from cuda.cuh (#17427) @davidwendt
  • Remove the unused detail int_fastdiv.h header (#17426) @PointKernel
  • Remove cudf._lib.lists in favor of inlining pylibcudf (#17425) @mroeschke
  • Remove cudf._lib.quantile (#17424) @mroeschke
  • Remove cudf._lib.rolling in favor of inlining pylibcudf (#17423) @mroeschke
  • Avoid converting Decimal32/Decimal64 in to_arrow and from_arrow APIs (#17422) @zeroshade
  • Rework minhash APIs for deprecation cycle (#17421) @davidwendt
  • Use threadindextype in binary-ops jit kernel.cu (#17420) @davidwendt
  • Change binops for-each kernel to thrust::foreachn (#17419) @davidwendt
  • Move cudf.lib.search to cudf.core.internals (#17411) @mroeschke
  • Use grid1d utilities in copyrange.cuh (#17409) @davidwendt
  • Remove cudf._lib.text in favor of inlining pylibcudf (#17408) @mroeschke
  • Run clang-tidy checks in PR CI (#17407) @bdice
  • Update strings/text source to use grid_1d for thread/block/stride calculations (#17404) @davidwendt
  • Expose stream-ordering to strings attribute APIs (#17398) @shrshi
  • Expose stream-ordering to interop APIs (#17397) @shrshi
  • Remove unused type aliases (#17396) @PointKernel
  • Remove some cudf._lib.strings files in favor of inlining pylibcudf (#17394) @mroeschke
  • Update xxhash_64 to utilize the cuco equivalent implementation (#17393) @PointKernel
  • Change indices for dictionary column to signed integer type (#17390) @davidwendt
  • Return categorical values in tonumpy/tocupy (#17388) @mroeschke
  • Forward-merge branch-24.12 to branch-25.02 (#17379) @bdice
  • Remove unused IO utilities from cudf python (#17374) @Matt711
  • Remove cudf._lib.datetime in favor of inlining pylibcudf (#17372) @mroeschke
  • Remove cudf._lib.join in favor of inlining pylibcudf (#17371) @mroeschke
  • Remove cudf._lib.merge in favor of inlining pylibcudf (#17370) @mroeschke
  • Remove cudf._lib.partitioning in favor of inlining pylibcudf (#17369) @mroeschke
  • Remove cudf._lib.reshape in favor of inlining pylibcudf (#17368) @mroeschke
  • Remove cudf._lib.timezone in favor of inlining pylibcudf (#17366) @mroeschke
  • Remove cudf._lib.transpose in favor of inlining pylibcudf (#17365) @mroeschke
  • Move makestringscolumn benchmark to nvbench (#17340) @davidwendt
  • Improve strings contains/find performance for smaller strings (#17330) @davidwendt
  • Use rapids-logger to generate the cudf logger (#17307) @vyasr
  • Mukernels strings (#17286) @pmattione-nvidia
  • Add write_parquet to pylibcudf (#17263) @mroeschke
  • Single-partition Dask executor for cuDF-Polars (#17262) @rjzamora
  • Add breaking change workflow trigger (#17248) @AyodeAwe
  • Precompute AST arity (#17234) @bdice
  • Update to CCCL 2.7.0-rc2. (#17233) @bdice
  • Make column_empty mask buffer creation consistent with libcudf (#16715) @mroeschke

- C++
Published by raydouglass 12 months ago

https://github.com/rapidsai/cudf - v25.02.01

🚨 Breaking Changes

  • Expose stream-ordering in scalar and avro APIs (#17766) @shrshi
  • Add seed parameter to hashcharacterngrams (#17643) @davidwendt
  • Performance improvements and simplifications for fixed size row-based rolling windows (#17623) @wence-
  • Refactor distinct hash join to handle multiple probes with the same build table (#17609) @PointKernel
  • Deprecate cudf::groupedtimerangerollingwindow (#17589) @wence-
  • Remove "legacy" Dask DataFrame support from Dask cuDF (#17558) @rjzamora
  • Return empty result for segmented_reduce if input and offsets are both empty (#17437) @davidwendt
  • Rework minhash APIs for deprecation cycle (#17421) @davidwendt
  • Change indices for dictionary column to signed integer type (#17390) @davidwendt

πŸ› Bug Fixes

  • Skip the failing connectorx polars tests (#18037) @Matt711
  • Fix 'Unexpected short subpass' exception in parquet chunked reader. (#18019) @nvdbaranec
  • Fix race check failures in shared memory groupby (#17985) @PointKernel
  • Pin ibis version in the cudf.pandas integration tests <10.0.0 (#17975) @Matt711
  • Fix the index type in the indexing operator of the span types (#17971) @vuule
  • Add missing pin (#17915) @vyasr
  • Fix third-party cudf.pandas tests (#17900) @galipremsagar
  • Fix numpy data access by making attribute private (#17890) @galipremsagar
  • Remove extra local var declaration from cudf.pandas 3rd-party integration shell script (#17886) @Matt711
  • Move isinstance_cudf_pandas to fast_slow_proxy (#17875) @galipremsagar
  • Make _Series_dtype method a property (#17854) @Matt711
  • Fix the bug in determining the heuristics for shared memory groupby (#17851) @PointKernel
  • Fix possible OOB mem access in Parquet decoder (#17841) @mhaseeb123
  • Require batches to be non-empty in multi-batch JSON reader (#17837) @shrshi
  • Fix rolling(minperiods=) with int and null data with mode.pandascompat (#17822) @mroeschke
  • Resolve race-condition in disable_module_accelerator (#17811) @galipremsagar
  • Make Series(dtype=object) raise in mode.pandas_compat with non string data (#17804) @mroeschke
  • Disable intended disabled ORC tests (#17790) @davidwendt
  • Fix empty DataFrame construction not returning RangeIndex columns (#17784) @mroeschke
  • Fix various .str methods for pandas compatability (#17782) @mroeschke
  • Fix count API issue about ignoring nan values (#17779) @galipremsagar
  • Add numba pinning to cudf repo (#17777) @galipremsagar
  • Allow .sortvalues(naposition=) to include NaNs in mode.pandas_compatible (#17776) @mroeschke
  • allow deselecting nvcomp wheels (#17774) @jameslamb
  • Use the aligned_resource_adaptor to allocate bloom filter device buffers (#17758) @mhaseeb123
  • Avoid instantiating bloom filter query function for nested and bool types (#17753) @mhaseeb123
  • Fix DataFrame.merge(Series, how="left"/"right") on column and index not resulting in a RangeIndex (#17739) @mroeschke
  • [BUG] xfail Polars excel test (#17731) @Matt711
  • Require to implement AutoCloseable for the classes derived from HostUDFWrapper (#17727) @ttnghia
  • Remove jlowe as a java committer since he retired (#17725) @tgravescs
  • Prevent use of invalid grid sizes in ORC reader and writer (#17709) @vuule
  • Enforce schema for partial tables in multi-source multi-batch JSON reader (#17708) @shrshi
  • Compute and use the initial string offset when building nested large string cols with chunked parquet reader (#17702) @mhaseeb123
  • Fix writing of compressed ORC files with large stripe footers (#17700) @vuule
  • Fix cudf.polars sum of empty not equalling zero (#17685) @mroeschke
  • Fix formatting in logging (#17680) @vuule
  • convert all nulls to nans in a specific scenario (#17677) @galipremsagar
  • Define cudf repr methods on the Column (#17675) @mroeschke
  • Fix groupby.len with null values in cudf.polars (#17671) @mroeschke
  • Fix: DataFrameGroupBy.get_group was raising with length>1 tuples (#17653) @MarcoGorelli
  • Fix possible int overflow in computemixedjoinoutputsize (#17633) @davidwendt
  • Fix a minor potential i32 overflow in thrust::transform_exclusive_scan in PQ reader preprocessing (#17617) @mhaseeb123
  • Fix failing xgboost test in the cudf.pandas third-party integration tests (#17616) @Matt711
  • Fix dask_cudf.read_csv (#17612) @rjzamora
  • Fix memcheck error in ReplaceTest.NormalizeNansAndZerosMutable gtest (#17610) @davidwendt
  • Correctly accept a pandas.CategoricalDtype(pandas.IntervalDtype(...), ...) type (#17604) @mroeschke
  • Add ability to modify and propagate names of columns object (#17597) @galipremsagar
  • Ignore NaN correctly in .quantile (#17593) @mroeschke
  • Fix groupby argmin/max gather of sorted-order indices (#17591) @davidwendt
  • Fix ctest fail running libcudf tests in a Debug build (#17576) @davidwendt
  • Specify a version for rapids_logger dependency (#17573) @jlowe
  • Fix the ORC decoding bug for the timestamp data (#17570) @kingcrimsontianyu
  • [JNI] remove rmm argument to set rw access for fabric handles (#17553) @abellina
  • Document undefined behavior in divroundingup_safe (#17542) @davidwendt
  • Fix nvcc-imposed UB in constexpr functions (#17534) @vuule
  • Add anonymous namespace to libcudf test source (#17529) @davidwendt
  • Propagate failures in pandas integration tests and Skip failing tests (#17521) @Matt711
  • Fix libcudf compile error when logging is disabled (#17512) @davidwendt
  • Fix Dask-cuDF clip APIs (#17509) @rjzamora
  • Fix pylibcudf to_arrow with multiple nested data types (#17504) @mroeschke
  • Fix groupby(as_index=False).size not reseting index (#17499) @mroeschke
  • Revert "Temporarily skip tests due to dask/distributed#8953" (#17492) @Matt711
  • Workaround for a misaligned access in read_csv on some CUDA versions (#17477) @vuule
  • Fix some possible thread-id overflow calculations (#17473) @davidwendt
  • Temporarily skip tests due to dask/distributed#8953 (#17472) @wence-
  • Detect mismatches in begin and end tokens returned by JSON tokenizer FST (#17471) @shrshi
  • Support dask>=2024.11.2 in Dask cuDF (#17439) @rjzamora
  • Fix write_json failure for zero columns in table/struct (#17414) @karthikeyann
  • Fix Debug-mode failing Arrow test (#17405) @zeroshade
  • Fix all null list column with missing child column in JSON reader (#17348) @karthikeyann

πŸ“– Documentation

  • Fix forward merge 24.12->25.02 (#18002) @raydouglass
  • Fix incorrect example in pylibcudf docs (#17912) @Matt711
  • Explicitly call out that the GPU open beta runs on a single GPU (#17872) @taureandyernv
  • Update cudf.pandas colab link in docs (#17846) @taureandyernv
  • [DOC] Make pylibcudf docs more visible (#17803) @Matt711
  • Cross-link cudf.pandas profiler documentation. (#17668) @bdice
  • Document interpreter install command for cudf.pandas (#17358) @bdice
  • add comment to Series.tolist method (#17350) @tequilayu

πŸš€ New Features

  • Bump polars version to <1.22 (#17771) @Matt711
  • Make more constexpr available on device for cuIO (#17746) @PointKernel
  • Add public interop functions between pylibcudf and cudf classic (#17730) @Matt711
  • Support dask_expr migration into dask.dataframe (#17704) @rjzamora
  • Make tests build without relaxed constexpr (#17691) @PointKernel
  • Set default logger level to warn (#17684) @vyasr
  • Support multithreaded reading of compressed buffers in JSON reader (#17670) @shrshi
  • Control pinned memory use with environment variables (#17657) @vuule
  • Host compression (#17656) @vuule
  • Enable text build without relying on relaxed constexpr (#17647) @PointKernel
  • Implement HOST_UDF aggregation for reduction and segmented reduction (#17645) @ttnghia
  • Add JSON reader options structs to pylibcudf (#17614) @Matt711
  • Refactor distinct hash join to handle multiple probes with the same build table (#17609) @PointKernel
  • Add JSON Writer options classes to pylibcudf (#17606) @Matt711
  • Add ORC reader options structs to pylibcudf (#17601) @Matt711
  • Add Avro Reader options classes to pylibcudf (#17599) @Matt711
  • Enable binaryop build without relying on relaxed constexpr (#17598) @PointKernel
  • Measure the number of Parquet row groups filtered by predicate pushdown (#17594) @mhaseeb123
  • Implement HOST_UDF aggregation for groupby (#17592) @ttnghia
  • Plumb pylibcudf.io.parquet options classes through cudf python (#17506) @Matt711
  • Add partition-wise Select support to cuDF-Polars (#17495) @rjzamora
  • Add multi-partition Scan support to cuDF-Polars (#17494) @rjzamora
  • Migrate cudf::io::merge_row_group_metadata to pylibcudf (#17491) @Matt711
  • Add Parquet Reader options classes to pylibcudf (#17464) @Matt711
  • Add multi-partition DataFrameScan support to cuDF-Polars (#17441) @rjzamora
  • Return empty result for segmented_reduce if input and offsets are both empty (#17437) @davidwendt
  • Abstract polars function expression nodes to ensure they are serializable (#17418) @pentschev
  • Add CSV Reader options classes to pylibcudf (#17412) @Matt711
  • Add support for pylibcudf.DataType serialization (#17352) @pentschev
  • Enable rounding for Decimal32 and Decimal64 in cuDF (#17332) @a-hirota
  • Remove upper bounds on cuda-python to allow 12.6.2 and 11.8.5 (#17326) @bdice
  • Expose stream-ordering to groupby APIs (#17324) @shrshi
  • Migrate ORC Writer to pylibcudf (#17310) @Matt711
  • Support reading bloom filters from Parquet files and filter row groups using them (#17289) @mhaseeb123

πŸ› οΈ Improvements

  • Update to nvcomp 4.2.0.11 (#18042) @bdice
  • Remove pandas backend from cudf.pandas - ibis integration tests (#17945) @Matt711
  • Revert CUDA 12.8 shared workflow branch changes (#17879) @vyasr
  • Remove predicate param from DataFrameScan IR (#17852) @Matt711
  • Remove cudf.Scalar from scatter APIs (#17847) @mroeschke
  • Remove cudf.Scalar from interval_range (#17844) @mroeschke
  • Add verify-codeowners hook (#17840) @KyleFromNVIDIA
  • Build and test with CUDA 12.8.0 (#17834) @bdice
  • Increase timeout for recently added test (#17829) @galipremsagar
  • Apply ruff everywhere (notebooks and scripts) (#17820) @bdice
  • Fix pre-commit.ci failures (#17819) @bdice
  • Remove incorrect calls to set architectures (#17813) @vyasr
  • Fix typo in exception raised when attempting to convert a string column to cupy (#17800) @dagardner-nv
  • Add support for pyarrow-19 (#17794) @galipremsagar
  • increase parallelism in nightly builds (#17792) @jameslamb
  • Reduce libcudf memcheck tests output (#17791) @davidwendt
  • Make cudf build with latest CCCL (#17788) @miscco
  • Introduce some more rolling window benchmarks (#17787) @wence-
  • Add shellcheck to pre-commit and fix warnings (#17778) @gforsyth
  • Improve parquet reader very-long string performance (#17773) @pmattione-nvidia
  • Update how to manage host UDF instance (#17770) @res-life
  • Add getInts api for HostMemoryBuffer and UnsafeMemoryAccessor (#17767) @liurenjie1024
  • Expose stream-ordering in scalar and avro APIs (#17766) @shrshi
  • Standarize methods used from cudf.core._internals (#17765) @mroeschke
  • Implement string join in cudf-polars (#17755) @wence-
  • Deprecate dataframe protocol (#17736) @vyasr
  • Add parquet reader long row test (#17735) @pmattione-nvidia
  • Update kvikio call due to upstream changes (#17733) @kingcrimsontianyu
  • Delay setting MultiIndex.level/codes until needed (#17728) @mroeschke
  • Bounding pool size in multi-batch JSON reader (#17724) @shrshi
  • Use GCC 13 in CUDA 12 conda builds. (#17721) @bdice
  • Update minimal sphinx theme version so that we can use parallel doc builds (#17719) @vyasr
  • Add more aggregation methods in pylibcudf (#17717) @mroeschke
  • Make cudf.lib.stringudf work with pylibcudf Columns instead of cudf._lib Columns (#17715) @mroeschke
  • Add special orc test data: timestamp interspersed with null values (#17713) @kingcrimsontianyu
  • Add pylibcudf.nullmask.nullcount (#17711) @mroeschke
  • Ensure pyarrow.Scalar to pylibcudf.Scalar is cached (#17707) @mroeschke
  • Adapt cudf numba config for numba 0.61 removal (#17705) @mroeschke
  • Remove cudf._lib.scalar in favor of pylibcudf (#17701) @mroeschke
  • Fix parquet reader list bug (#17699) @pmattione-nvidia
  • Migrated Dynamic AST Expression Trees in Benchmarks and Tests to use AST Tree (#17697) @lamarrr
  • Skip polars test that can generate timezones that chrono_tz doesn't know (#17694) @wence-
  • Use 64-bit offsets only if the current strings column output chunk size exceeds threshold (#17693) @mhaseeb123
  • Use latest ci-conda images (#17690) @bdice
  • Add multi-source reading to JSON reader benchmarks (#17688) @shrshi
  • Convert cudf.Scalar usage to pylibcudf and pyarrow usage (#17686) @mroeschke
  • remove find_package(Python) in libcudf build (#17683) @jameslamb
  • Fix build metrics report format with long placehold filenames (#17679) @davidwendt
  • Use rapids-cmake for the logger (#17674) @vyasr
  • Java Parquet reads via multiple host buffers (#17673) @jlowe
  • Remove cudf._libs.types.pyx (#17665) @mroeschke
  • Add support for Groupby.cumprod (#17661) @galipremsagar
  • Implement .dt.total_seconds (#17659) @galipremsagar
  • Avoid shallow copies in groupby methods (#17646) @mroeschke
  • Avoid double MultiIndex factorization in groupby index result (#17644) @mroeschke
  • Add seed parameter to hashcharacterngrams (#17643) @davidwendt
  • Fix possible overflow in WriteCoalescingCallbackWrapper::TearDown (#17642) @davidwendt
  • Remove pragma GCC diagnostic from source files (#17637) @davidwendt
  • Move unnecessary utilities from cudf._lib.scalar (#17636) @mroeschke
  • Support compression= in DataFrame.to_json (#17634) @mroeschke
  • Bump Polars version to <1.18 (#17632) @Matt711
  • Add public APIs to Access Underlying cudf and pandas Objects from cudf.pandas Proxy Objects (#17629) @galipremsagar
  • Use Numba Config to turn on Pynvjitlink Features (#17628) @isVoid
  • Use PyNVML 12 (#17627) @jakirkham
  • Remove cudf._lib.utils in favor of python APIs (#17625) @mroeschke
  • Performance improvements and simplifications for fixed size row-based rolling windows (#17623) @wence-
  • Fix return types for MurmurHash3x8632 template specializations (#17622) @davidwendt
  • Clean up namespaces and improve compression-related headers (#17621) @vuule
  • Use more pylibcudf.types instead of cudf._lib.types (#17619) @mroeschke
  • Remove patch that is only needed for clang-tidy to run on test files (#17618) @vyasr
  • update telemetry actions to fluent-bit friendly style (#17615) @msarahan
  • Introduce some simple benchmarks for rolling window aggregations (#17613) @wence-
  • Bump the oldest pyarrow version to 14.0.2 in test matrix (#17611) @galipremsagar
  • Use [[nodiscard]] attribute before __device__ (#17608) @vuule
  • Use host_vector in flatten_single_pass_aggs (#17605) @vuule
  • Stop memory_resource.hpp from including itself (#17603) @vyasr
  • Replace the outdated cuco window concept with buckets (#17602) @PointKernel
  • Check if nightlies have succeeded recently enough (#17596) @vyasr
  • Deprecate cudf::groupedtimerangerollingwindow (#17589) @wence-
  • A couple of fixes in rapids-logger usage (#17588) @vyasr
  • Simplify expression transformer in Parquet predicate pushdown with ast::tree (#17587) @mhaseeb123
  • Remove unused functionality in cudf._lib.utils.pyx (#17586) @mroeschke
  • Use cuda-python cuda.bindings import names. (#17585) @bdice
  • Use no-sync copy for fixed-width types in cudf::concatenate (#17584) @davidwendt
  • Remove cudf._lib.groupby in favor of inlining pylibcudf (#17582) @mroeschke
  • Remove unused code of json schema in JSON reader (#17581) @karthikeyann
  • Expose Scalar's constructor and Scalar#getScalarHandle() to public (#17580) @ttnghia
  • Allow large strings in nvtext benchmarks (#17579) @davidwendt
  • Remove cudf._lib.reduce in favor of inlining pylibcudf (#17574) @mroeschke
  • Use batched memcpy when writing ORC statistics (#17572) @vuule
  • Allow large strings in nvbench strings benchmarks (#17571) @davidwendt
  • Update version references in workflow (#17568) @AyodeAwe
  • Enable all json reader options in pylibcudf read_json (#17563) @karthikeyann
  • Remove cudf._lib.parquet in favor of inlining pylibcudf (#17562) @mroeschke
  • Fix CMake format in cudf/_lib/CMakeLists.txt (#17559) @mroeschke
  • Remove "legacy" Dask DataFrame support from Dask cuDF (#17558) @rjzamora
  • Replace direct cudaMemcpyAsync calls with utility functions (within /include) (#17557) @vuule
  • Remove cudf._lib.interop in favor of inlining pylibcudf (#17555) @mroeschke
  • gate telemetry dispatch calls on TELEMETRY_ENABLED env var (#17551) @msarahan
  • Replace direct cudaMemcpyAsync calls with utility functions (within /src) (#17550) @vuule
  • Remove unused BufferArrayFromVector (#17549) @Matt711
  • Move cudf.lib.copying to cudf.core.internals (#17548) @mroeschke
  • Update cuda-python lower bounds to 12.6.2 / 11.8.5 (#17547) @bdice
  • Fix typos, rename types, and add null_probability benchmark axis for distinct (#17546) @PointKernel
  • Mark more constexpr functions as device-available (#17545) @vyasr
  • Use cooperative-groups instead of cub warp-reduce for strings contains (#17540) @davidwendt
  • Remove cudf._lib.nvtext in favor of inlining pylibcudf (#17535) @mroeschke
  • Add XXHash_32 hasher (#17533) @PointKernel
  • Remove unused masked keyword in column_empty (#17530) @mroeschke
  • Remove Thrust patch in favor of CMake definition for Thrust 32-bit offset types. (#17527) @bdice
  • [JNI] Enables fabric handles for CUDA async memory pools (#17526) @abellina
  • Force Thrust to use 32-bit offset type. (#17523) @bdice
  • Replace cudf::detail::copyif logic with thrust::copyif and gather (#17520) @davidwendt
  • Replaces uses of cudf._lib.Column.from_unique_ptr with pylibcudf.Column.from_libcudf (#17517) @Matt711
  • Move cudf.lib.aggregation to cudf.core.internals (#17516) @mroeschke
  • Migrate copycolumn and Column.fromscalar to pylibcudf (#17513) @Matt711
  • Remove cudf._lib.transform in favor of inlining pylibcudf (#17505) @mroeschke
  • Remove cudf._lib.string.convert/split in favor of inlining pylibcudf (#17496) @mroeschke
  • Move cudf.lib.sort to cudf.core.internals (#17488) @mroeschke
  • Remove cudf._lib.csv in favor in inlining pylibcudf (#17485) @mroeschke
  • Update PyTorch to >=2.4.0 to get fix for CUDA array interface bug, and drop CUDA 11 PyTorch tests. (#17475) @bdice
  • Remove cudf._lib.binops in favor of inlining pylibcudf (#17468) @mroeschke
  • Remove cudf._lib.orc in favor of inlining pylibcudf (#17466) @mroeschke
  • skip most CI on devcontainer-only changes (#17465) @jameslamb
  • Set build type for all examples (#17463) @vyasr
  • Update the hook versions in pre-commit (#17462) @wence-
  • Remove cudf.lib.stringcasting in favor of inlining pylibcudf (#17460) @mroeschke
  • Remove cudf._lib.filling in favor of inlining pylibcudf (#17459) @mroeschke
  • Update MurmurHash3x64128 to use the cuco equivalent implementation (#17457) @PointKernel
  • Move cudf.lib.streamcompaction to cudf.core._internals (#17456) @mroeschke
  • Clean up xxhash_64 implementations (#17455) @PointKernel
  • Update Hadoop dependency in Java pom (#17454) @jlowe
  • Adapt to rmm logger changes (#17451) @vyasr
  • Require approval to run CI on draft PRs (#17450) @bdice
  • Expose stream-ordering in nvtext API (#17446) @shrshi
  • Use execpolicynosync in write_json (#17445) @karthikeyann
  • Remove cudf._lib.json in favor of inlining pylibcudf (#17443) @mroeschke
  • Remove cudf.lib.nullmask in favor of inlining pylibcudf (#17440) @mroeschke
  • Expose stream-ordering in replace API (#17436) @shrshi
  • Expose stream-ordering in copying APIs (#17435) @shrshi
  • Expose stream-ordering in column view APIs (#17434) @shrshi
  • Apply clang-tidy autofixes from new rules (#17431) @vyasr
  • Remove cudf._lib.round in favor of inlining pylibcudf (#17430) @mroeschke
  • Update MurmurHash3x8632 to use the cuco equivalent implementation (#17429) @PointKernel
  • Remove cudf._lib.replace in favor of inlining pylibcudf (#17428) @mroeschke
  • Remove nvtx/ranges.hpp include from cuda.cuh (#17427) @davidwendt
  • Remove the unused detail int_fastdiv.h header (#17426) @PointKernel
  • Remove cudf._lib.lists in favor of inlining pylibcudf (#17425) @mroeschke
  • Remove cudf._lib.quantile (#17424) @mroeschke
  • Remove cudf._lib.rolling in favor of inlining pylibcudf (#17423) @mroeschke
  • Avoid converting Decimal32/Decimal64 in to_arrow and from_arrow APIs (#17422) @zeroshade
  • Rework minhash APIs for deprecation cycle (#17421) @davidwendt
  • Use threadindextype in binary-ops jit kernel.cu (#17420) @davidwendt
  • Change binops for-each kernel to thrust::foreachn (#17419) @davidwendt
  • Move cudf.lib.search to cudf.core.internals (#17411) @mroeschke
  • Use grid1d utilities in copyrange.cuh (#17409) @davidwendt
  • Remove cudf._lib.text in favor of inlining pylibcudf (#17408) @mroeschke
  • Run clang-tidy checks in PR CI (#17407) @bdice
  • Update strings/text source to use grid_1d for thread/block/stride calculations (#17404) @davidwendt
  • Expose stream-ordering to strings attribute APIs (#17398) @shrshi
  • Expose stream-ordering to interop APIs (#17397) @shrshi
  • Remove unused type aliases (#17396) @PointKernel
  • Remove some cudf._lib.strings files in favor of inlining pylibcudf (#17394) @mroeschke
  • Update xxhash_64 to utilize the cuco equivalent implementation (#17393) @PointKernel
  • Change indices for dictionary column to signed integer type (#17390) @davidwendt
  • Return categorical values in tonumpy/tocupy (#17388) @mroeschke
  • Forward-merge branch-24.12 to branch-25.02 (#17379) @bdice
  • Remove unused IO utilities from cudf python (#17374) @Matt711
  • Remove cudf._lib.datetime in favor of inlining pylibcudf (#17372) @mroeschke
  • Remove cudf._lib.join in favor of inlining pylibcudf (#17371) @mroeschke
  • Remove cudf._lib.merge in favor of inlining pylibcudf (#17370) @mroeschke
  • Remove cudf._lib.partitioning in favor of inlining pylibcudf (#17369) @mroeschke
  • Remove cudf._lib.reshape in favor of inlining pylibcudf (#17368) @mroeschke
  • Remove cudf._lib.timezone in favor of inlining pylibcudf (#17366) @mroeschke
  • Remove cudf._lib.transpose in favor of inlining pylibcudf (#17365) @mroeschke
  • Move makestringscolumn benchmark to nvbench (#17340) @davidwendt
  • Improve strings contains/find performance for smaller strings (#17330) @davidwendt
  • Use rapids-logger to generate the cudf logger (#17307) @vyasr
  • Mukernels strings (#17286) @pmattione-nvidia
  • Add write_parquet to pylibcudf (#17263) @mroeschke
  • Single-partition Dask executor for cuDF-Polars (#17262) @rjzamora
  • Add breaking change workflow trigger (#17248) @AyodeAwe
  • Precompute AST arity (#17234) @bdice
  • Update to CCCL 2.7.0-rc2. (#17233) @bdice
  • Make column_empty mask buffer creation consistent with libcudf (#16715) @mroeschke

- C++
Published by AyodeAwe 12 months ago

https://github.com/rapidsai/cudf - v24.12.00

🚨 Breaking Changes

  • Fix reading Parquet string cols when nrows and input_pass_limit > 0 (#17321) @mhaseeb123
  • prefer wheel-provided libcudf.so in loadlibrary(), use RTLDLOCAL (#17316) @jameslamb
  • Deprecate single component extraction methods in libcudf (#17221) @Matt711
  • Move detail header floating_conversion.hpp to detail subdirectory (#17209) @davidwendt
  • Refactor Dask cuDF legacy code (#17205) @rjzamora
  • Make HostMemoryBuffer call into the DefaultHostMemoryAllocator (#17204) @revans2
  • Remove java reservation (#17189) @revans2
  • Separate evaluation logic from IR objects in cudf-polars (#17175) @rjzamora
  • Upgrade to polars 1.11 in cudf-polars (#17154) @wence-
  • Remove the additional host register calls initially intended for performance improvement on Grace Hopper (#17092) @kingcrimsontianyu
  • Correctly set is_device_accesible when creating host_spans from other container/span types (#17079) @vuule
  • Unify treatment of Expr and IR nodes in cudf-polars DSL (#17016) @wence-
  • Deprecate support for directly accessing logger (#16964) @vyasr
  • Made cudftestutil header-only and removed GTest dependency (#16839) @lamarrr

πŸ› Bug Fixes

  • Turn off cudf.pandas 3rd party integrations tests for 24.12 (#17500) @Matt711
  • Ignore errors when testing glibc versions (#17389) @vyasr
  • Adapt to KvikIO API change in the compatibility mode (#17377) @kingcrimsontianyu
  • Support pivot with index or column arguments as lists (#17373) @mroeschke
  • Deselect failing polars tests (#17362) @pentschev
  • Fix integer overflow in compiled binaryop (#17354) @wence-
  • Update cmake to 3.28.6 in JNI Dockerfile (#17342) @jlowe
  • fix library-loading issues in editable installs (#17338) @jameslamb
  • Bug fix: restrict lines=True to JSON format in Kafka read_gdf method (#17333) @a-hirota
  • Fix various issues with replace API and add support in datetime and timedelta columns (#17331) @galipremsagar
  • Do not exclude nanoarrow and flatbuffers from installation if statically linked (#17322) @hyperbolic2346
  • Fix reading Parquet string cols when nrows and input_pass_limit > 0 (#17321) @mhaseeb123
  • Remove another reference to FindcuFile (#17315) @KyleFromNVIDIA
  • Fix reading of single-row unterminated CSV files (#17305) @vuule
  • Fixed lifetime issue in ast transform tests (#17292) @lamarrr
  • Switch to using TaskSpec (#17285) @galipremsagar
  • Fix datatype ctor call in JSONTEST (#17273) @davidwendt
  • Expose delimiter character in JSON reader options to JSON reader APIs (#17266) @shrshi
  • Fix extract-datetime deprecation warning in ndsh benchmark (#17254) @davidwendt
  • Disallow cuda-python 12.6.1 and 11.8.4 (#17253) @bdice
  • Wrap custom iterator result (#17251) @galipremsagar
  • Fix binop with LHS numpy datetimelike scalar (#17226) @mroeschke
  • Fix Dataframe.__setitem__ slow-downs (#17222) @galipremsagar
  • Fix groupby.get_group with length-1 tuple with list-like grouper (#17216) @mroeschke
  • Fix discoverability of submodules inside pd.util (#17215) @galipremsagar
  • Fix Schema.Builder does not propagate precision value to Builder instance (#17214) @ttnghia
  • Mark column chunks in a PQ reader pass as large strings when the cumulative offsets exceeds the large strings threshold. (#17207) @mhaseeb123
  • [BUG] Replace repo_token with github_token in Auto Assign PR GHA (#17203) @Matt711
  • Remove unsanitized nulls from input strings columns in reduction gtests (#17202) @davidwendt
  • Fix to_parquet append behavior with global metadata file (#17198) @rjzamora
  • Check num_children() == 0 in Column.from_column_view (#17193) @cwharris
  • Fix host-to-device copy missing sync in strings/duration convert (#17149) @davidwendt
  • Add JNI Support for Multi-line Delimiters and Include Test (#17139) @SurajAralihalli
  • Ignore loud dask warnings about legacy dataframe implementation (#17137) @galipremsagar
  • Fix the GDS read/write segfault/bus error when the cuFile policy is set to GDS or ALWAYS (#17122) @kingcrimsontianyu
  • Fix DataFrame._from_arrays and introduce validations (#17112) @galipremsagar
  • [Bug] Fix Arrow-FS parquet reader for larger files (#17099) @rjzamora
  • Fix bug in recovering invalid lines in JSONL inputs (#17098) @shrshi
  • Reenable huge pages for arrow host copying (#17097) @vyasr
  • Correctly set is_device_accesible when creating host_spans from other container/span types (#17079) @vuule
  • Fix ORC reader when using device_read_async while the destination device buffers are not ready (#17074) @ttnghia
  • Fix regex handling of fixed quantifier with 0 range (#17067) @davidwendt
  • Limit the number of keys to calculate column sizes and page starts in PQ reader to 1B (#17059) @mhaseeb123
  • Adding assertion to check for regular JSON inputs of size greater than INT_MAX bytes (#17057) @shrshi
  • bug fix: use self.ck_consumer in poll method of kafka.py to align with __init__ (#17044) @a-hirota
  • Disable kvikio remote I/O to avoid openssl dependencies in JNI build (#17026) @pxLi
  • Fix host_span constructor to correctly copy is_device_accessible (#17020) @vuule
  • Add pinning for pyarrow in wheels (#17018) @vyasr
  • Use std::optional for host types (#17015) @robertmaynard
  • Fix write_json to handle empty string column (#16995) @karthikeyann
  • Restore export of nvcomp outside of wheel builds (#16988) @KyleFromNVIDIA
  • Allow melt(var_name=) to be a falsy label (#16981) @mroeschke
  • Fix astype from tz-aware type to tz-aware type (#16980) @mroeschke
  • Use libcudf wheel from PR rather than nightly for polars-polars CI test job (#16975) @brandon-b-miller
  • Fix order-preservation in pandas-compat unsorted groupby (#16942) @wence-
  • Fix cudf::strings::findall error with empty input (#16928) @davidwendt
  • Fix JsonLargeReaderTest.MultiBatch use of LIBCUDFJSONBATCH_SIZE env var (#16927) @davidwendt
  • Parse newline as whitespace character while tokenizing JSONL inputs with non-newline delimiter (#16923) @shrshi
  • Respect groupby.nunique(dropna=False) (#16921) @mroeschke
  • Update all rmm imports to use pylibrmm/librmm (#16913) @Matt711
  • Fix order-preservation in cudf-polars groupby (#16907) @wence-
  • Add a shortcut for when the input clusters are all empty for the tdigest merge (#16897) @jihoonson
  • Properly handle the mapped and registered regions in memory_mapped_source (#16865) @vuule
  • Fix performance regression for generatecharacterngrams (#16849) @davidwendt
  • Fix regex parsing logic handling of nested quantifiers (#16798) @davidwendt
  • Compute whole column variance using numerically stable approach (#16448) @wence-

πŸ“– Documentation

  • Add documentation for low memory readers (#17314) @btepera
  • Fix the example in documentation for get_dremel_data() (#17242) @mhaseeb123
  • Fix some documentation rendering for pylibcudf (#17217) @mroeschke
  • Move detail header floating_conversion.hpp to detail subdirectory (#17209) @davidwendt
  • Add TokenizeVocabulary to api docs (#17208) @davidwendt
  • Add jaccard_index to generated cuDF docs (#17199) @davidwendt
  • [no ci] Add empty-columns section to the libcudf developer guide (#17183) @davidwendt
  • Add 2-cpp approvers text to contributing guide no ci @davidwendt
  • Changing developer guide int64t to int64_t (#17130) @hyperbolic2346
  • docs: change 'CSV' to 'csv' in python/custreamz/README.md to match kafka.py (#17041) @a-hirota
  • [DOC] Document limitation using cudf.pandas proxy arrays (#16955) @Matt711
  • [DOC] Document environment variable for failing on fallback in cudf.pandas (#16932) @Matt711

πŸš€ New Features

  • Add version config (#17312) @vyasr
  • Java JNI for Multiple contains (#17281) @res-life
  • Add cudf::calendrical_month_sequence to pylibcudf (#17277) @Matt711
  • Raise errors on specific types of fallback in cudf.pandas (#17268) @Matt711
  • Add catboost to the third-party integration tests (#17267) @Matt711
  • Add type stubs for pylibcudf (#17258) @wence-
  • Use pylibcudf contiguous split APIs in cudf python (#17246) @Matt711
  • Upgrade nvcomp to 4.1.0.6 (#17201) @bdice
  • Added Arrow Interop Benchmarks (#17194) @lamarrr
  • Rewrite Java API Table.readJSON to return the output from libcudf read_json directly (#17180) @ttnghia
  • Support storing precision of decimal types in Schema class (#17176) @ttnghia
  • Migrate CSV writer to pylibcudf (#17163) @Matt711
  • Add computesharedmemory_aggs used by shared memory groupby (#17162) @PointKernel
  • Added ast tree to simplify expression lifetime management (#17156) @lamarrr
  • Add computemappingindices used by shared memory groupby (#17147) @PointKernel
  • Add remaining datetime APIs to pylibcudf (#17143) @Matt711
  • Added strings AST vs BINARY_OP benchmarks (#17128) @lamarrr
  • Use libcudf_exception_handler throughout pylibcudf.libcudf (#17109) @brandon-b-miller
  • Include timezone file path in error message (#17102) @bdice
  • Migrate NVText Byte Pair Encoding APIs to pylibcudf (#17101) @Matt711
  • Migrate NVText Tokenizing APIs to pylibcudf (#17100) @Matt711
  • Migrate NVtext subword tokenizing APIs to pylibcudf (#17096) @Matt711
  • Migrate NVText Stemming APIs to pylibcudf (#17085) @Matt711
  • Migrate NVText Replacing APIs to pylibcudf (#17084) @Matt711
  • Add IWYU to CI (#17078) @vyasr
  • cudf-polars string/numeric casting (#17076) @brandon-b-miller
  • Migrate NVText Normalizing APIs to Pylibcudf (#17072) @Matt711
  • Migrate remaining nvtext NGrams APIs to pylibcudf (#17070) @Matt711
  • Add profilers to CUDA 12 conda devcontainers (#17066) @vyasr
  • Add conda recipe for cudf-polars (#17037) @bdice
  • Implement batch construction for strings columns (#17035) @ttnghia
  • Add device aggregators used by shared memory groupby (#17031) @PointKernel
  • Add optional column_order in JSON reader (#17029) @karthikeyann
  • Migrate Min Hashing APIs to pylibcudf (#17021) @Matt711
  • Reorganize cudf_polars expression code (#17014) @brandon-b-miller
  • Migrate nvtext jaccard API to pylibcudf (#17007) @Matt711
  • Migrate nvtext generate_ngrams APIs to pylibcudf (#17006) @Matt711
  • Control whether a file data source memory-maps the file with an environment variable (#17004) @vuule
  • Switched BINARY_OP Benchmarks from GoogleBench to NVBench (#16963) @lamarrr
  • [FEA] Report all unsupported operations for a query in cudf.polars (#16960) @Matt711
  • [FEA] Migrate nvtext/edit_distance APIs to pylibcudf (#16957) @Matt711
  • Switched AST benchmarks from GoogleBench to NVBench (#16952) @lamarrr
  • Extend device_scalar to optionally use pinned bounce buffer (#16947) @vuule
  • Implement cudf-polars chunked parquet reading (#16944) @brandon-b-miller
  • Expose streams in public round APIs (#16925) @Matt711
  • add telemetry setup to test (#16924) @msarahan
  • Add cudf::strings::contains_multiple (#16900) @davidwendt
  • Made cudftestutil header-only and removed GTest dependency (#16839) @lamarrr
  • Add an example to demonstrate multithreaded read_parquet pipelines (#16828) @mhaseeb123
  • Implement extract_datetime_component in libcudf/pylibcudf (#16776) @brandon-b-miller
  • Add cudf::strings::find_re API (#16742) @davidwendt
  • Migrate hashing operations to pylibcudf (#15418) @brandon-b-miller

πŸ› οΈ Improvements

  • Simplify serialization protocols (#17552) @vyasr
  • Add pynvml as a dependency for dask-cudf (#17386) @pentschev
  • Enable unified memory by default in cudf_polars (#17375) @galipremsagar
  • Support polars 1.14 (#17355) @wence-
  • Remove cudf._lib.quantiles in favor of inlining pylibcudf (#17347) @mroeschke
  • Remove cudf._lib.labeling in favor of inlining pylibcudf (#17346) @mroeschke
  • Remove cudf._lib.hash in favor of inlining pylibcudf (#17345) @mroeschke
  • Remove cudf._lib.concat in favor of inlining pylibcudf (#17344) @mroeschke
  • Extract GPUEngine config options at translation time (#17339) @rjzamora
  • Update java datetime APIs to match CUDF. (#17329) @revans2
  • Move strings url_decode benchmarks to nvbench (#17328) @davidwendt
  • Move strings translate benchmarks to nvbench (#17325) @davidwendt
  • Writing compressed output using JSON writer (#17323) @shrshi
  • Test the full matrix for polars and dask wheels on nightlies (#17320) @vyasr
  • Remove cudf._lib.avro in favor of inlining pylicudf (#17319) @mroeschke
  • Move cudf.lib.unary to cudf.core.internals (#17318) @mroeschke
  • prefer wheel-provided libcudf.so in loadlibrary(), use RTLDLOCAL (#17316) @jameslamb
  • Clean up misc, unneeded pylibcudf.libcudf in cudf._lib (#17309) @mroeschke
  • Exclude nanoarrow and flatbuffers from installation (#17308) @vyasr
  • Update CI jobs to include Polars in nightlies and improve IWYU (#17306) @vyasr
  • Move strings repeat benchmarks to nvbench (#17304) @davidwendt
  • Fix synchronization bug in bool parquet mukernels (#17302) @pmattione-nvidia
  • Move strings replace benchmarks to nvbench (#17301) @davidwendt
  • Support polars 1.13 (#17299) @wence-
  • Replace FindcuFile with upstream FindCUDAToolkit support (#17298) @KyleFromNVIDIA
  • Expose stream-ordering in public transpose API (#17294) @shrshi
  • Replace workaround of JNI build with CUDFKVIKIOREMOTE_IO=OFF (#17293) @pxLi
  • cmake option: CUDF_KVIKIO_REMOTE_IO (#17291) @madsbk
  • Use more pylibcudf Python enums in cudf._lib (#17288) @mroeschke
  • Use pylibcudf enums in cudf Python quantile (#17287) @mroeschke
  • enforce wheel size limits, README formatting in CI (#17284) @jameslamb
  • Use numba-cuda<0.0.18 (#17280) @gmarkall
  • Add computecolumnexpression to pylibcudf for transform.compute_column (#17279) @mroeschke
  • Optimize distinct inner join to use set find instead of retrieve (#17278) @PointKernel
  • remove WheelHelpers.cmake (#17276) @jameslamb
  • Plumb pylibcudf datetime APIs through cudf python (#17275) @Matt711
  • Follow up making Python tests more deterministic (#17272) @mroeschke
  • Use pylibcudf.search APIs in cudf python (#17271) @Matt711
  • Use pylibcudf.strings.convert.convert_integers.is_integer in cudf python (#17270) @Matt711
  • Move strings filter benchmarks to nvbench (#17269) @davidwendt
  • Make constructor of DeviceMemoryBufferView public (#17265) @liurenjie1024
  • Put a ceiling on cuda-python (#17264) @jameslamb
  • Always prefer device_reads and device_writes when kvikIO is enabled (#17260) @vuule
  • Expose streams in public quantile APIs (#17257) @shrshi
  • Add support for pyarrow-18 (#17256) @galipremsagar
  • Move strings/numeric convert benchmarks to nvbench (#17255) @davidwendt
  • Add new dask_cudf.read_parquet API (#17250) @rjzamora
  • Add readparquetmetadata to pylibcudf (#17245) @mroeschke
  • Search for kvikio with lowercase (#17243) @vyasr
  • KvikIO shared library (#17239) @madsbk
  • Use more pylibcudf.io.types enums in cudf._libs (#17237) @mroeschke
  • Expose mixed and conditional joins in pylibcudf (#17235) @wence-
  • Add io.text APIs to pylibcudf (#17232) @mroeschke
  • Add num_iterations axis to the multi-threaded Parquet benchmarks (#17231) @vuule
  • Move strings to date/time types benchmarks to nvbench (#17229) @davidwendt
  • Support for polars 1.12 in cudf-polars (#17227) @wence-
  • Allow generating large strings in benchmarks (#17224) @davidwendt
  • Refactor gather/scatter benchmarks for strings (#17223) @davidwendt
  • Deprecate single component extraction methods in libcudf (#17221) @Matt711
  • Remove nvtext::load_vocabulary from pylibcudf (#17220) @Matt711
  • Benchmarking JSON reader for compressed inputs (#17219) @shrshi
  • Expose stream-ordering in partitioning API (#17213) @shrshi
  • Move strings::concatenate benchmark to nvbench (#17211) @davidwendt
  • Expose stream-ordering in subword tokenizer API (#17206) @shrshi
  • Refactor Dask cuDF legacy code (#17205) @rjzamora
  • Make HostMemoryBuffer call into the DefaultHostMemoryAllocator (#17204) @revans2
  • Unified binary_ops and ast benchmarks parameter names (#17200) @lamarrr
  • Add in new java API for raw host memory allocation (#17197) @revans2
  • Remove java reservation (#17189) @revans2
  • Fixed unused attribute compilation error for GCC 13 (#17188) @lamarrr
  • Change default KvikIO parameters in cuDF: set the thread pool size to 4, and compatibility mode to ON (#17185) @kingcrimsontianyu
  • Use makedeviceuvector instead of cudaMemcpyAsync in inplacebitmaskbinop (#17181) @davidwendt
  • Make ai.rapids.cudf.HostMemoryBuffer#copyFromStream public. (#17179) @liurenjie1024
  • Separate evaluation logic from IR objects in cudf-polars (#17175) @rjzamora
  • Move nvtext ngrams benchmarks to nvbench (#17173) @davidwendt
  • Remove includes suggested by include-what-you-use (#17170) @vyasr
  • Reading multi-source compressed JSONL files (#17161) @shrshi
  • Process parquet bools with microkernels (#17157) @pmattione-nvidia
  • Upgrade to polars 1.11 in cudf-polars (#17154) @wence-
  • Deprecate current libcudf nvtext minhash functions (#17152) @davidwendt
  • Remove unused variable in internal merge_tdigests utility (#17151) @davidwendt
  • Use the full ref name of rmm.DeviceBuffer in the sphinx config file (#17150) @Matt711
  • Move segmented_gather function from the copying module to the lists module (#17148) @Matt711
  • Use async execution policy for true_if (#17146) @PointKernel
  • Add conversion from cudf-polars expressions to libcudf ast for parquet filters (#17141) @wence-
  • devcontainer: replace VAULT_HOST with AWS_ROLE_ARN (#17134) @jjacobelli
  • Replace direct cudaMemcpyAsync calls with utility functions (limited to cudf::io) (#17132) @vuule
  • use rapids-generate-pip-constraints to pin to oldest dependencies in CI (#17131) @jameslamb
  • Set the default number of threads in KvikIO thread pool to 8 (#17126) @kingcrimsontianyu
  • Fix clang-tidy violations for span.hpp and hostdevice_vector.hpp (#17124) @davidwendt
  • Disable the Parquet reader's wide lists tables GTest by default (#17120) @mhaseeb123
  • Add compile time check to ensure the counting_iterator type in counting_transform_iterator fits in size_type (#17118) @mhaseeb123
  • Minor I/O code quality improvements (#17105) @kingcrimsontianyu
  • Remove the additional host register calls initially intended for performance improvement on Grace Hopper (#17092) @kingcrimsontianyu
  • Split hash-based groupby into multiple smaller files to reduce build time (#17089) @PointKernel
  • build wheels without build isolation (#17088) @jameslamb
  • Polars: DataFrame Serialization (#17062) @madsbk
  • Remove unused hash helper functions (#17056) @PointKernel
  • Add todlpack/fromdlpack APIs to pylibcudf (#17055) @mroeschke
  • Move flatten_single_pass_aggs to its own TU (#17053) @PointKernel
  • Replace deprecated cuco APIs with updated versions (#17052) @PointKernel
  • Refactor ORC dictionary encoding to migrate to the new cuco::static_map (#17049) @mhaseeb123
  • Move pylibcudf/libcudf/wrappers/decimals to pylibcudf/libcudf/fixed_point (#17048) @mroeschke
  • make conda installs in CI stricter (part 2) (#17042) @jameslamb
  • Use managed memory for NDSH benchmarks (#17039) @karthikeyann
  • Clean up hash-groupby var_hash_functor (#17034) @PointKernel
  • Add json APIs to pylibcudf (#17025) @mroeschke
  • Add string.replace_re APIs to pylibcudf (#17023) @mroeschke
  • Replace old host tree algorithm with new algorithm in JSON reader (#17019) @karthikeyann
  • Unify treatment of Expr and IR nodes in cudf-polars DSL (#17016) @wence-
  • make conda installs in CI stricter (#17013) @jameslamb
  • Pylibcudf: pack and unpack (#17012) @madsbk
  • Remove unneeded pylibcudf.libcudf.wrappers.duration usage in cudf (#17010) @mroeschke
  • Add custom "fused" groupby aggregation to Dask cuDF (#17009) @rjzamora
  • Make tests more deterministic (#17008) @galipremsagar
  • Remove unused import (#17005) @Matt711
  • Add string.convert.convert_urls APIs to pylibcudf (#17003) @mroeschke
  • Add release tracking to project automation scripts (#17001) @jarmak-nv
  • Implement inequality joins by translation to conditional joins (#17000) @wence-
  • Add string.convert.convert_lists APIs to pylibcudf (#16997) @mroeschke
  • Performance optimization of JSON validation (#16996) @karthikeyann
  • Add string.convert.convert_ipv4 APIs to pylibcudf (#16994) @mroeschke
  • Add string.convert.convert_integers APIs to pylibcudf (#16991) @mroeschke
  • Add string.convert_floats APIs to pylibcudf (#16990) @mroeschke
  • Add string.convert.convertfixedtype APIs to pylibcudf (#16984) @mroeschke
  • Remove unnecessary std::move's in pylibcudf (#16983) @Matt711
  • Add docstrings and test for strings.convert_durations APIs for pylibcudf (#16982) @mroeschke
  • JSON tokenizer memory optimizations (#16978) @shrshi
  • Turn on xfail_strict = true for all python packages (#16977) @wence-
  • Add string.convert.convertdatetime/convertbooleans APIs to pylibcudf (#16971) @mroeschke
  • Auto assign PR to author (#16969) @Matt711
  • Deprecate support for directly accessing logger (#16964) @vyasr
  • Expunge NamedColumn (#16962) @wence-
  • Add clang-tidy to CI (#16958) @vyasr
  • Address all remaining clang-tidy errors (#16956) @vyasr
  • Apply clang-tidy autofixes (#16949) @vyasr
  • Use nvcomp wheel instead of bundling nvcomp (#16946) @KyleFromNVIDIA
  • Refactor the cuda_memcpy functions to make them more usable (#16945) @vuule
  • Add string.split APIs to pylibcudf (#16940) @mroeschke
  • clang-tidy fixes part 3 (#16939) @vyasr
  • clang-tidy fixes part 2 (#16938) @vyasr
  • clang-tidy fixes part 1 (#16937) @vyasr
  • Add string.wrap APIs to pylibcudf (#16935) @mroeschke
  • Add string.translate APIs to pylibcudf (#16934) @mroeschke
  • Add string.find_multiple APIs to pylibcudf (#16920) @mroeschke
  • Batch memcpy the last offsets for output buffers of str and list cols in PQ reader (#16905) @mhaseeb123
  • reduce wheel build verbosity, narrow deprecation warning filter (#16896) @jameslamb
  • Improve aggregation device functors (#16884) @PointKernel
  • Upgrade pandas pinnings to support 2.2.3 (#16882) @galipremsagar
  • Fix 24.10 to 24.12 forward merge (#16876) @bdice
  • Manually resolve conflicts in between branch-24.12 and branch-24.10 (#16871) @galipremsagar
  • Add in support for setting delim when parsing JSON through java (#16867) @revans2
  • Reapply mixed_semi_join refactoring and bug fixes (#16859) @mhaseeb123
  • Add string padding and side_type APIs to pylibcudf (#16833) @mroeschke
  • Organize parquet reader mukernel non-nullable code, introduce manual block scans (#16830) @pmattione-nvidia
  • Remove superfluous use of std::vector for std::future (#16829) @kingcrimsontianyu
  • Rework read_csv IO to avoid reading whole input with a single host_read (#16826) @vuule
  • Add strings.combine APIs to pylibcudf (#16790) @mroeschke
  • Add remaining string.char_types APIs to pylibcudf (#16788) @mroeschke
  • Add new nvtext minhash_permuted API (#16756) @davidwendt
  • Avoid public constructors when called with columns to avoid unnecessary validation (#16747) @mroeschke
  • Use changed-files shared workflow (#16713) @KyleFromNVIDIA
  • lint: replace isort with Ruff's rule I (#16685) @Borda
  • Improve the performance of low cardinality groupby (#16619) @PointKernel
  • Parquet reader list microkernel (#16538) @pmattione-nvidia
  • AWS S3 IO through KvikIO (#16499) @madsbk
  • Refactor histogram reduction using cuco::static_set::insert_and_find (#16485) @srinivasyadav18
  • Use numba-cuda>=0.0.13 (#16474) @gmarkall

- C++
Published by GPUtester about 1 year ago

https://github.com/rapidsai/cudf - v24.10.01

This hotfix corrected some python packaging issues.

Full Changelog: https://github.com/rapidsai/cudf/compare/v24.10.00...v24.10.01

- C++
Published by raydouglass over 1 year ago

https://github.com/rapidsai/cudf - v24.10.00

🚨 Breaking Changes

  • Whitespace normalization of nested column coerced as string column in JSONL inputs (#16759) @shrshi
  • Add libcudf wrappers around currentdeviceresource functions. (#16679) @harrism
  • Fix empty cluster handling in tdigest merge (#16675) @jihoonson
  • Remove java ColumnView.copyWithBooleanColumnAsValidity (#16660) @revans2
  • Support reading multiple PQ sources with mismatching nullability for columns (#16639) @mhaseeb123
  • Remove arrowiosource (#16607) @vyasr
  • Remove legacy Arrow interop APIs (#16590) @vyasr
  • Remove NativeFile support from cudf Python (#16589) @vyasr
  • Revert "Make proxy NumPy arrays pass isinstance check in cudf.pandas" (#16586) @Matt711
  • Align public utility function signatures with pandas 2.x (#16565) @mroeschke
  • Disallow cudf.Index accepting column in favor of .fromcolumn (#16549) @mroeschke
  • Refactor dictionary encoding in PQ writer to migrate to the new cuco::static_map (#16541) @mhaseeb123
  • Change IPv4 convert APIs to support UINT32 instead of INT64 (#16489) @davidwendt
  • enable list to be forced as string in JSON reader. (#16472) @karthikeyann
  • Disallow cudf.Series to accept column in favor of ._from_column (#16454) @mroeschke
  • Align groupby APIs with pandas 2.x (#16403) @mroeschke
  • Align misc DataFrame and MultiIndex methods with pandas 2.x (#16402) @mroeschke
  • Align Index APIs with pandas 2.x (#16361) @mroeschke
  • Add stream param to stream compaction APIs (#16295) @JayjeetAtGithub

πŸ› Bug Fixes

  • Add license to the pylibcudf wheel (#16976) @raydouglass
  • Parse newline as whitespace character while tokenizing JSONL inputs with non-newline delimiter (#16950) @shrshi
  • Add dask-cudf workaround for missing rename_axis support in cudf (#16899) @rjzamora
  • Update oldest deps for pyarrow & numpy (#16883) @galipremsagar
  • Update labeler for pylibcudf (#16868) @vyasr
  • Revert "Refactor mixedsemijoin using cuco::static_set" (#16855) @mhaseeb123
  • Fix metadata after implicit array conversion from Dask cuDF (#16842) @rjzamora
  • Add cudf.pandas dependencies.yaml to update-version.sh (#16840) @raydouglass
  • Use cupy 12.2.0 as oldest dependency pinning on CUDA 12 ARM (#16808) @bdice
  • Revert "Fix empty cluster handling in tdigest merge (#16675)" (#16800) @jihoonson
  • Intentionally leak thread_local CUDA resources to avoid crash (part 1) (#16787) @kingcrimsontianyu
  • Fix cov/corr bug in dask-cudf (#16786) @rjzamora
  • Fix slice_strings wide strings logic with multi-byte characters (#16777) @davidwendt
  • Fix nvbench output for sha512 (#16773) @davidwendt
  • Allow readcsv(header=None) to return int column labels in `mode.pandascompatible` (#16769) @mroeschke
  • Whitespace normalization of nested column coerced as string column in JSONL inputs (#16759) @shrshi
  • Fix DataFrame.drop(columns=cudf.Series/Index, axis=1) (#16712) @mroeschke
  • Use merge base when calculating changed files (#16709) @KyleFromNVIDIA
  • Ensure we pass the hasnulls tparam to mixedjoin kernels (#16708) @abellina
  • Add boost-devel to Java CI Docker image (#16707) @jlowe
  • [BUG] Add gpu node type to cudf-pandas 3rd-party integration nightly CI job (#16704) @Matt711
  • Fix typo in column_factories.hpp comment from 'depth 1' to 'depth 2' (#16700) @a-hirota
  • Fix Series.to_frame(name=None) setting a None name (#16698) @mroeschke
  • Disable gtests/ERROR_TEST during compute-sanitizer memcheck test (#16691) @davidwendt
  • Enable batched multi-source reading of JSONL files with large records (#16687) @shrshi
  • Handle ordered parameter in CategoricalIndex.__repr__ (#16683) @galipremsagar
  • Fix loc/iloc.setitem[:, loc] with non cupy types (#16677) @mroeschke
  • Fix empty cluster handling in tdigest merge (#16675) @jihoonson
  • Fix cudf::rank not getting enough params (#16666) @JayjeetAtGithub
  • Fix slowdown in CategoricalIndex.__repr__ (#16665) @galipremsagar
  • Remove java ColumnView.copyWithBooleanColumnAsValidity (#16660) @revans2
  • Fix slowdown in DataFrame repr in jupyter notebook (#16656) @galipremsagar
  • Preserve Series name in duplicated method. (#16655) @bdice
  • Fix interval_range right child non-zero offset (#16651) @mroeschke
  • fix libcudf wheel publishing, make package-type explicit in wheel publishing (#16650) @jameslamb
  • Revert "Hide all gtest symbols in cudftestutil (#16546)" (#16644) @robertmaynard
  • Fix integer overflow in indexalator pointer logic (#16643) @davidwendt
  • Allow for binops between two differently sized DecimalDtypes (#16638) @mroeschke
  • Move pragma once in rolling/jit/operation.hpp. (#16636) @bdice
  • Fix overflow bug in low-memory JSON reader (#16632) @shrshi
  • Add the missing num_aggregations axis for groupby_max_cardinality (#16630) @PointKernel
  • Fix strings::detail::copy_range when target contains nulls (#16626) @davidwendt
  • Fix function parameters with common dependency modified during their evaluation (#16620) @ttnghia
  • bug-fix: Don't enable the CUDA language if testing was requested when finding cudf (#16615) @cryos
  • bug-fix: cudf/io/json.hpp use after move (#16609) @NicolasDenoyelle
  • Remove CUDA whole compilation ODR violations (#16603) @robertmaynard
  • MAINT: Adapt to numpy hiding flagsobject away (#16593) @seberg
  • Revert "Make proxy NumPy arrays pass isinstance check in cudf.pandas" (#16586) @Matt711
  • Switch python version to 3.10 in cudf.pandas pandas test scripts (#16559) @galipremsagar
  • Hide all gtest symbols in cudftestutil (#16546) @robertmaynard
  • Update the java code to properly deal with lists being returned as strings (#16536) @revans2
  • Register read_parquet and read_csv with dask-expr (#16535) @rjzamora
  • Change cudf::empty_like to not include offsets for empty strings columns (#16529) @davidwendt
  • Fix DataFrame reductions with median returning scalar instead of Series (#16527) @mroeschke
  • Allow DataFrame.sort_values(by=) to select an index level (#16519) @mroeschke
  • Fix date_range(start, end, freq) when end-start is divisible by freq (#16516) @mroeschke
  • Preserve array name in MultiIndex.from_arrays (#16515) @mroeschke
  • Disallow indexing by selecting duplicate labels (#16514) @mroeschke
  • Fix .replace(Index, Index) raising a TypeError (#16513) @mroeschke
  • Check index bounds in compact protocol reader. (#16493) @bdice
  • Fix build failures with GCC 13 (#16488) @PointKernel
  • Fix all-empty input column for strings split APIs (#16466) @davidwendt
  • Fix segmented-sort overlapped input/output indices (#16463) @davidwendt
  • Fix merge conflict for auto merge 16447 (#16449) @davidwendt

πŸ“– Documentation

  • Fix links in Dask cuDF documentation (#16929) @rjzamora
  • Improve aggregation documentation (#16822) @PointKernel
  • Add best practices page to Dask cuDF docs (#16821) @rjzamora
  • [DOC] Update Pylibcudf doc strings (#16810) @Matt711
  • Recommending miniforge for conda install (#16782) @mmccarty
  • Add labeling pylibcudf doc pages (#16779) @mroeschke
  • Migrate dask-cudf README improvements to dask-cudf sphinx docs (#16765) @rjzamora
  • [DOC] Remove out of date section from cudf.pandas docs (#16697) @Matt711
  • Add performance tips to cudf.pandas FAQ. (#16693) @bdice
  • Update documentation for Dask cuDF (#16671) @rjzamora
  • Add missing pylibcudf strings docs (#16471) @brandon-b-miller
  • DOC: Refresh pylibcudf guide (#15856) @lithomas1

πŸš€ New Features

  • Build cudf-polars with build.sh (#16898) @brandon-b-miller
  • Add polars to "all" dependency list. (#16875) @bdice
  • nvCOMP GZIP integration (#16770) @vuule
  • [FEA] Add support for cudf.NamedAgg (#16744) @Matt711
  • Add experimental filesystem=&quot;arrow&quot; support in dask_cudf.read_parquet (#16684) @rjzamora
  • Relax Arrow pin (#16681) @vyasr
  • Add libcudf wrappers around currentdeviceresource functions. (#16679) @harrism
  • Move NDS-H examples into benchmarks (#16663) @JayjeetAtGithub
  • [FEA] Add third-party library integration testing of cudf.pandas to cudf (#16645) @Matt711
  • Make isinstance check pass for proxy ndarrays (#16601) @Matt711
  • [FEA] Add an environment variable to fail on fallback in cudf.pandas (#16562) @Matt711
  • [FEA] Add support for cudf.unique (#16554) @Matt711
  • [FEA] Support named aggregations in df.groupby().agg() (#16528) @Matt711
  • Change IPv4 convert APIs to support UINT32 instead of INT64 (#16489) @davidwendt
  • enable list to be forced as string in JSON reader. (#16472) @karthikeyann
  • Remove cuDF dependency from pylibcudf column from_device tests (#16441) @brandon-b-miller
  • Enable cudf.pandas REPL and -c command support (#16428) @bdice
  • Setup pylibcudf package (#16299) @lithomas1
  • Add a libcudf/thrust-based TPC-H derived datagen (#16294) @JayjeetAtGithub
  • Make proxy NumPy arrays pass isinstance check in cudf.pandas (#16286) @Matt711
  • Add skiprows and nrows to parquet reader (#16214) @lithomas1
  • Upgrade to nvcomp 4.0.1 (#16076) @vuule
  • Migrate ORC reader to pylibcudf (#16042) @lithomas1
  • JSON reader validation of values (#15968) @karthikeyann
  • Implement exposed null mask APIs in pylibcudf (#15908) @charlesbluca
  • Word-based nvtext::minhash function (#15368) @davidwendt

πŸ› οΈ Improvements

  • Make tests deterministic (#16910) @galipremsagar
  • Update update-version.sh to use packaging lib (#16891) @AyodeAwe
  • Pin polars for 24.10 and update polars test suite xfail list (#16886) @wence-
  • Add in support for setting delim when parsing JSON through java (#16867) (#16880) @revans2
  • Remove unnecessary flag from build.sh (#16879) @vyasr
  • Ignore numba warning specific to ARM runners (#16872) @galipremsagar
  • Display deltas for cudf.pandas test summary (#16864) @galipremsagar
  • Switch to using native traceback (#16851) @galipremsagar
  • JSON tree algorithm code reorg (#16836) @karthikeyann
  • Add string.repeats API to pylibcudf (#16834) @mroeschke
  • Use CI workflow branch 'branch-24.10' again (#16832) @jameslamb
  • Rename the NDS-H benchmark binaries (#16831) @JayjeetAtGithub
  • Add string.findall APIs to pylibcudf (#16825) @mroeschke
  • Add string.extract APIs to pylibcudf (#16823) @mroeschke
  • use get-pr-info from nv-gha-runners (#16819) @AyodeAwe
  • Add string.contains APIs to pylibcudf (#16814) @mroeschke
  • Forward-merge branch-24.08 to branch-24.10 (#16813) @bdice
  • Add iotype axis with default `PINNEDBUFFER` to nvbench PQ multithreaded reader (#16809) @mhaseeb123
  • Update fmt (to 11.0.2) and spdlog (to 1.14.1). (#16806) @jameslamb
  • Add ability to set parquet row group max #rows and #bytes in java (#16805) @pmattione-nvidia
  • Add in option for Java JSON APIs to do column pruning in CUDF (#16796) @revans2
  • Support dropfirst in getdummies (#16795) @mroeschke
  • Exposed stream-ordering to join API (#16793) @lamarrr
  • Add string.attributes APIs to pylibcudf (#16785) @mroeschke
  • Java: Make ColumnVector.fromViewWithContiguousAllocation public (#16784) @jlowe
  • Add partitioning APIs to pylibcudf (#16781) @mroeschke
  • Optimization of tdigest merge aggregation. (#16780) @nvdbaranec
  • use libkvikio wheels in wheel builds (#16778) @jameslamb
  • Exposed stream-ordering to datetime API (#16774) @lamarrr
  • Add io/timezone APIs to pylibcudf (#16771) @mroeschke
  • Remove MultiIndex._poplevel inplace implementation. (#16767) @mroeschke
  • allow pandas patch version to float in cudf-pandas unit tests (#16763) @jameslamb
  • Simplify the nvCOMP adapter (#16762) @vuule
  • Add labeling APIs to pylibcudf (#16761) @mroeschke
  • Add transform APIs to pylibcudf (#16760) @mroeschke
  • Add a benchmark to study Parquet reader's performance for wide tables (#16751) @mhaseeb123
  • Change the Parquet writer's default_row_group_size_bytes from 128MB to inf (#16750) @mhaseeb123
  • Add transpose API to pylibcudf (#16749) @mroeschke
  • Add support for Python 3.12, update Kafka dependencies to 2.5.x (#16745) @jameslamb
  • Generate GPU vs CPU usage metrics per pytest file in pandas testsuite for cudf.pandas (#16739) @galipremsagar
  • Refactor cudf pandas integration tests CI (#16728) @Matt711
  • Remove ERROR_TEST gtest from libcudf (#16722) @davidwendt
  • Use Series.fromcolumn more consistently to avoid validation (#16716) @mroeschke
  • remove some unnecessary libcudf nightly builds (#16714) @jameslamb
  • Remove xfail from torch-cudf.pandas integration test (#16705) @Matt711
  • Add return type annotations to MultiIndex (#16696) @mroeschke
  • Add type annotations to Index classes, utilize fromcolumn more (#16695) @mroeschke
  • Have intervalrange use IntervalIndex.frombreaks, remove columnemptysame_mask (#16694) @mroeschke
  • Increase timeouts for couple of tests (#16692) @galipremsagar
  • Replace raw devicememoryresource pointer in pylibcudf Cython (#16674) @harrism
  • switch from typing.Callable to collections.abc.Callable (#16670) @jameslamb
  • Update rapidsai/pre-commit-hooks (#16669) @KyleFromNVIDIA
  • Multi-file and Parquet-aware prefetching from remote storage (#16657) @rjzamora
  • Access Frame attributes instead of ColumnAccessor attributes when available (#16652) @mroeschke
  • Use non-mangled type names in nvbench output (#16649) @davidwendt
  • Add pylibcudf build dir in build.sh for clean (#16648) @galipremsagar
  • Prune workflows based on changed files (#16642) @KyleFromNVIDIA
  • Remove arrow dependency (#16640) @vyasr
  • Support reading multiple PQ sources with mismatching nullability for columns (#16639) @mhaseeb123
  • Drop Python 3.9 support (#16637) @jameslamb
  • Support DecimalDtype meta in dask_cudf (#16634) @mroeschke
  • Add num_multiprocessors utility (#16628) @PointKernel
  • Annotate ColumnAccessor._data labels as Hashable (#16623) @mroeschke
  • Remove buildcategoricalcolumn in favor of CategoricalColumn constructor (#16617) @mroeschke
  • Move applybooleanmask benchmark to nvbench (#16616) @davidwendt
  • Revise get_reader_filepath_or_buffer to handle a list of data sources (#16613) @rjzamora
  • do not install cudf in cudf_polars wheel tests (#16612) @jameslamb
  • remove streamz git dependency, standardize build dependency names, consolidate some dependency lists (#16611) @jameslamb
  • Fix C++ and Cython io types (#16610) @vyasr
  • Remove arrowiosource (#16607) @vyasr
  • Remove thrust::optional from expression evaluator (#16604) @bdice
  • Add stricter typing and validation to ColumnAccessor (#16602) @mroeschke
  • make more use of YAML anchors in dependencies.yaml (#16597) @jameslamb
  • Enable testing cudf.pandas unit tests for all minor versions of pandas (#16595) @galipremsagar
  • Extend the Parquet writer's dictionary encoding benchmark. (#16591) @mhaseeb123
  • Remove legacy Arrow interop APIs (#16590) @vyasr
  • Remove NativeFile support from cudf Python (#16589) @vyasr
  • Add build job for pylibcudf (#16587) @vyasr
  • Add public qualifier for some member functions in Java class Schema (#16583) @ttnghia
  • Enable gtests previously disabled for compute-sanitizer bug (#16581) @davidwendt
  • [FEA] Add filesystem argument to cudf.read_parquet (#16577) @rjzamora
  • Ensure size is always passed to NumericalColumn (#16576) @mroeschke
  • standardize and consolidate wheel installations in testing scripts (#16575) @jameslamb
  • Performance improvement for strings::slice for wide strings (#16574) @davidwendt
  • Add ToCudfBackend expression to dask-cudf (#16573) @rjzamora
  • CI: Test against old versions of key dependencies (#16570) @seberg
  • Replace NativeFile dependency in dask-cudf Parquet reader (#16569) @rjzamora
  • Align public utility function signatures with pandas 2.x (#16565) @mroeschke
  • Move libcudf reduction google-benchmarks to nvbench (#16564) @davidwendt
  • Rework strings::slice benchmark to use nvbench (#16563) @davidwendt
  • Reenable arrow tests (#16556) @vyasr
  • Clean up reshaping ops (#16553) @mroeschke
  • Disallow cudf.Index accepting column in favor of .fromcolumn (#16549) @mroeschke
  • Rewrite remaining Python Arrow interop conversions using the C Data Interface (#16548) @vyasr
  • [REVIEW] JSON host tree algorithms (#16545) @shrshi
  • Refactor dictionary encoding in PQ writer to migrate to the new cuco::static_map (#16541) @mhaseeb123
  • Remove hardcoded versions from workflows. (#16540) @bdice
  • Ensure comparisons with pyints and integer series always succeed (#16532) @seberg
  • Remove unneeded output size parameter from internal count_matches utility (#16531) @davidwendt
  • Remove invalid column_view usage in string-scalar-to-column function (#16530) @davidwendt
  • Raise NotImplementedError for Series.rename that's not a scalar (#16525) @mroeschke
  • Remove deprecated public APIs from libcudf (#16524) @davidwendt
  • Return Interval object in pandas compat mode for IntervalIndex reductions (#16523) @mroeschke
  • Update json normalization to take device_buffer (#16520) @karthikeyann
  • Rework cudf::io::text::byterangeinfo class member functions (#16518) @davidwendt
  • Remove unneeded pair-iterator benchmark (#16511) @davidwendt
  • Update pre-commit hooks (#16510) @KyleFromNVIDIA
  • Improve update-version.sh (#16506) @bdice
  • Use tool.scikit-build.cmake.version, set scikit-build-core minimum-version (#16503) @jameslamb
  • Pass batch size to JSON reader using environment variable (#16502) @shrshi
  • Remove a deprecated multibyte_split API (#16501) @davidwendt
  • Add interop example for arrow::StringViewArray to cudf::column (#16498) @JayjeetAtGithub
  • Add keep option to distinct nvbench (#16497) @bdice
  • Use more idomatic cudf APIs in dask_cudf meta generation (#16487) @mroeschke
  • Fix typo in dispatchrowequal. (#16473) @bdice
  • Use explicit construction of column subclass instead of build_column when type is known (#16470) @mroeschke
  • Move exception handler into pylibcudf from cudf (#16468) @lithomas1
  • Make StructColumn.init strict (#16467) @mroeschke
  • Make ListColumn.init strict (#16465) @mroeschke
  • Make Timedelta/DatetimeColumn.init strict (#16464) @mroeschke
  • Make NumericalColumn.init strict (#16457) @mroeschke
  • Make CategoricalColumn.init strict (#16456) @mroeschke
  • Disallow cudf.Series to accept column in favor of ._from_column (#16454) @mroeschke
  • Expose stream param in transform APIs (#16452) @JayjeetAtGithub
  • Add upper bound pin for polars (#16442) @wence-
  • Make (Indexed)Frame.init require data (and index) (#16430) @mroeschke
  • Add Java APIs to copy column data to host asynchronously (#16429) @jlowe
  • Update docs of the TPC-H derived examples (#16423) @JayjeetAtGithub
  • Use RMM adaptor constructors instead of factories. (#16414) @bdice
  • Align ewm APIs with pandas 2.x (#16413) @mroeschke
  • Remove checking for specific tests in memcheck script (#16412) @davidwendt
  • Add stream parameter to reshape APIs (#16410) @davidwendt
  • Align groupby APIs with pandas 2.x (#16403) @mroeschke
  • Align misc DataFrame and MultiIndex methods with pandas 2.x (#16402) @mroeschke
  • update some branch references in GitHub Actions configs (#16397) @jameslamb
  • Support reading matching projected and filter cols from Parquet files with otherwise mismatched schemas (#16394) @mhaseeb123
  • Merge branch-24.08 into branch-24.10 (#16393) @jameslamb
  • Add query 10 to the TPC-H suite (#16392) @JayjeetAtGithub
  • Use make_host_vector instead of make_std_vector to facilitate pinned memory optimizations (#16386) @vuule
  • Fix some issues with deprecated / removed cccl facilities (#16377) @miscco
  • Align IntervalIndex APIs with pandas 2.x (#16371) @mroeschke
  • Align CategoricalIndex APIs with pandas 2.x (#16369) @mroeschke
  • Align TimedeltaIndex APIs with pandas 2.x (#16368) @mroeschke
  • Align DatetimeIndex APIs with pandas 2.x (#16367) @mroeschke
  • fix [tool.setuptools] reference in custreamz config (#16365) @jameslamb
  • Align Index APIs with pandas 2.x (#16361) @mroeschke
  • Rebuild for & Support NumPy 2 (#16300) @jakirkham
  • Add stream param to stream compaction APIs (#16295) @JayjeetAtGithub
  • Added batch memset to memset data and validity buffers in parquet reader (#16281) @sdrp713
  • Deduplicate decimal32/decimal64 to decimal128 conversion function (#16236) @mhaseeb123
  • Refactor mixedsemijoin using cuco::static_set (#16230) @srinivasyadav18
  • Improve performance of hashcharacterngrams using warp-per-string kernel (#16212) @davidwendt
  • Add environment variable to log cudf.pandas fallback calls (#16161) @mroeschke
  • Add libcudf example with large strings (#15983) @davidwendt
  • JSON tree algorithms refactor I: CSR data structure for column tree (#15979) @shrshi
  • Support multiple new-line characters in regex APIs (#15961) @davidwendt
  • adding wheel build for libcudf (#15483) @msarahan
  • Replace usages of thrust::optional with std::optional (#15091) @miscco

- C++
Published by raydouglass over 1 year ago

https://github.com/rapidsai/cudf - [NIGHTLY] v24.12.00

πŸ”— Links

🚨 Breaking Changes

  • Move detail header floating_conversion.hpp to detail subdirectory (#17209) @davidwendt
  • Refactor Dask cuDF legacy code (#17205) @rjzamora
  • Make HostMemoryBuffer call into the DefaultHostMemoryAllocator (#17204) @revans2
  • Remove java reservation (#17189) @revans2
  • Upgrade to polars 1.11 in cudf-polars (#17154) @wence-
  • Remove the additional host register calls initially intended for performance improvement on Grace Hopper (#17092) @kingcrimsontianyu
  • Correctly set is_device_accesible when creating host_spans from other container/span types (#17079) @vuule
  • Unify treatment of Expr and IR nodes in cudf-polars DSL (#17016) @wence-
  • Deprecate support for directly accessing logger (#16964) @vyasr
  • Made cudftestutil header-only and removed GTest dependency (#16839) @lamarrr

πŸ› Bug Fixes

  • Fix binop with LHS numpy datetimelike scalar (#17226) @mroeschke
  • Fix groupby.get_group with length-1 tuple with list-like grouper (#17216) @mroeschke
  • Fix discoverability of submodules inside pd.util (#17215) @galipremsagar
  • Fix Schema.Builder does not propagate precision value to Builder instance (#17214) @ttnghia
  • [BUG] Replace repo_token with github_token in Auto Assign PR GHA (#17203) @Matt711
  • Remove unsanitized nulls from input strings columns in reduction gtests (#17202) @davidwendt
  • Fix to_parquet append behavior with global metadata file (#17198) @rjzamora
  • Check num_children() == 0 in Column.from_column_view (#17193) @cwharris
  • Fix host-to-device copy missing sync in strings/duration convert (#17149) @davidwendt
  • Add JNI Support for Multi-line Delimiters and Include Test (#17139) @SurajAralihalli
  • Ignore loud dask warnings about legacy dataframe implementation (#17137) @galipremsagar
  • Fix the GDS read/write segfault/bus error when the cuFile policy is set to GDS or ALWAYS (#17122) @kingcrimsontianyu
  • Fix DataFrame._from_arrays and introduce validations (#17112) @galipremsagar
  • [Bug] Fix Arrow-FS parquet reader for larger files (#17099) @rjzamora
  • Fix bug in recovering invalid lines in JSONL inputs (#17098) @shrshi
  • Reenable huge pages for arrow host copying (#17097) @vyasr
  • Correctly set is_device_accesible when creating host_spans from other container/span types (#17079) @vuule
  • Fix ORC reader when using device_read_async while the destination device buffers are not ready (#17074) @ttnghia
  • Fix regex handling of fixed quantifier with 0 range (#17067) @davidwendt
  • Limit the number of keys to calculate column sizes and page starts in PQ reader to 1B (#17059) @mhaseeb123
  • Adding assertion to check for regular JSON inputs of size greater than INT_MAX bytes (#17057) @shrshi
  • bug fix: use self.ck_consumer in poll method of kafka.py to align with __init__ (#17044) @a-hirota
  • Disable kvikio remote I/O to avoid openssl dependencies in JNI build (#17026) @pxLi
  • Fix host_span constructor to correctly copy is_device_accessible (#17020) @vuule
  • Add pinning for pyarrow in wheels (#17018) @vyasr
  • Use std::optional for host types (#17015) @robertmaynard
  • Fix write_json to handle empty string column (#16995) @karthikeyann
  • Restore export of nvcomp outside of wheel builds (#16988) @KyleFromNVIDIA
  • Allow melt(var_name=) to be a falsy label (#16981) @mroeschke
  • Fix astype from tz-aware type to tz-aware type (#16980) @mroeschke
  • Use libcudf wheel from PR rather than nightly for polars-polars CI test job (#16975) @brandon-b-miller
  • Fix order-preservation in pandas-compat unsorted groupby (#16942) @wence-
  • Fix cudf::strings::findall error with empty input (#16928) @davidwendt
  • Fix JsonLargeReaderTest.MultiBatch use of LIBCUDFJSONBATCH_SIZE env var (#16927) @davidwendt
  • Parse newline as whitespace character while tokenizing JSONL inputs with non-newline delimiter (#16923) @shrshi
  • Respect groupby.nunique(dropna=False) (#16921) @mroeschke
  • Update all rmm imports to use pylibrmm/librmm (#16913) @Matt711
  • Fix order-preservation in cudf-polars groupby (#16907) @wence-
  • Add a shortcut for when the input clusters are all empty for the tdigest merge (#16897) @jihoonson
  • Properly handle the mapped and registered regions in memory_mapped_source (#16865) @vuule
  • Fix performance regression for generatecharacterngrams (#16849) @davidwendt
  • Fix regex parsing logic handling of nested quantifiers (#16798) @davidwendt
  • Compute whole column variance using numerically stable approach (#16448) @wence-

πŸ“– Documentation

  • Fix some documentation rendering for pylibcudf (#17217) @mroeschke
  • Move detail header floating_conversion.hpp to detail subdirectory (#17209) @davidwendt
  • Add TokenizeVocabulary to api docs (#17208) @davidwendt
  • Add jaccard_index to generated cuDF docs (#17199) @davidwendt
  • [no ci] Add empty-columns section to the libcudf developer guide (#17183) @davidwendt
  • Add 2-cpp approvers text to contributing guide no ci @davidwendt
  • Changing developer guide int64t to int64_t (#17130) @hyperbolic2346
  • docs: change 'CSV' to 'csv' in python/custreamz/README.md to match kafka.py (#17041) @a-hirota
  • [DOC] Document limitation using cudf.pandas proxy arrays (#16955) @Matt711
  • [DOC] Document environment variable for failing on fallback in cudf.pandas (#16932) @Matt711

πŸš€ New Features

  • Upgrade nvcomp to 4.1.0.6 (#17201) @bdice
  • Support storing precision of decimal types in Schema class (#17176) @ttnghia
  • Add computesharedmemory_aggs used by shared memory groupby (#17162) @PointKernel
  • Add computemappingindices used by shared memory groupby (#17147) @PointKernel
  • Add remaining datetime APIs to pylibcudf (#17143) @Matt711
  • Added strings AST vs BINARY_OP benchmarks (#17128) @lamarrr
  • Include timezone file path in error message (#17102) @bdice
  • Migrate NVText Byte Pair Encoding APIs to pylibcudf (#17101) @Matt711
  • Migrate NVText Tokenizing APIs to pylibcudf (#17100) @Matt711
  • Migrate NVtext subword tokenizing APIs to pylibcudf (#17096) @Matt711
  • Migrate NVText Stemming APIs to pylibcudf (#17085) @Matt711
  • Migrate NVText Replacing APIs to pylibcudf (#17084) @Matt711
  • Migrate NVText Normalizing APIs to Pylibcudf (#17072) @Matt711
  • Migrate remaining nvtext NGrams APIs to pylibcudf (#17070) @Matt711
  • Add profilers to CUDA 12 conda devcontainers (#17066) @vyasr
  • Add conda recipe for cudf-polars (#17037) @bdice
  • Implement batch construction for strings columns (#17035) @ttnghia
  • Add device aggregators used by shared memory groupby (#17031) @PointKernel
  • Migrate Min Hashing APIs to pylibcudf (#17021) @Matt711
  • Reorganize cudf_polars expression code (#17014) @brandon-b-miller
  • Migrate nvtext jaccard API to pylibcudf (#17007) @Matt711
  • Migrate nvtext generate_ngrams APIs to pylibcudf (#17006) @Matt711
  • Control whether a file data source memory-maps the file with an environment variable (#17004) @vuule
  • Switched BINARY_OP Benchmarks from GoogleBench to NVBench (#16963) @lamarrr
  • [FEA] Migrate nvtext/edit_distance APIs to pylibcudf (#16957) @Matt711
  • Switched AST benchmarks from GoogleBench to NVBench (#16952) @lamarrr
  • Extend device_scalar to optionally use pinned bounce buffer (#16947) @vuule
  • Expose streams in public round APIs (#16925) @Matt711
  • Made cudftestutil header-only and removed GTest dependency (#16839) @lamarrr
  • Add an example to demonstrate multithreaded read_parquet pipelines (#16828) @mhaseeb123
  • Implement extract_datetime_component in libcudf/pylibcudf (#16776) @brandon-b-miller
  • Add cudf::strings::find_re API (#16742) @davidwendt
  • Migrate hashing operations to pylibcudf (#15418) @brandon-b-miller

πŸ› οΈ Improvements

  • Use more pylibcudf.io.types enums in cudf._libs (#17237) @mroeschke
  • Expose mixed and conditional joins in pylibcudf (#17235) @wence-
  • Add num_iterations axis to the multi-threaded Parquet benchmarks (#17231) @vuule
  • Support for polars 1.12 in cudf-polars (#17227) @wence-
  • Remove nvtext::load_vocabulary from pylibcudf (#17220) @Matt711
  • Expose stream-ordering in partitioning API (#17213) @shrshi
  • Move strings::concatenate benchmark to nvbench (#17211) @davidwendt
  • Expose stream-ordering in subword tokenizer API (#17206) @shrshi
  • Refactor Dask cuDF legacy code (#17205) @rjzamora
  • Make HostMemoryBuffer call into the DefaultHostMemoryAllocator (#17204) @revans2
  • Unified binary_ops and ast benchmarks parameter names (#17200) @lamarrr
  • Add in new java API for raw host memory allocation (#17197) @revans2
  • Remove java reservation (#17189) @revans2
  • Fixed unused attribute compilation error for GCC 13 (#17188) @lamarrr
  • Change default KvikIO parameters in cuDF: set the thread pool size to 4, and compatibility mode to ON (#17185) @kingcrimsontianyu
  • Use makedeviceuvector instead of cudaMemcpyAsync in inplacebitmaskbinop (#17181) @davidwendt
  • Make ai.rapids.cudf.HostMemoryBuffer#copyFromStream public. (#17179) @liurenjie1024
  • Move nvtext ngrams benchmarks to nvbench (#17173) @davidwendt
  • Remove includes suggested by include-what-you-use (#17170) @vyasr
  • Upgrade to polars 1.11 in cudf-polars (#17154) @wence-
  • Deprecate current libcudf nvtext minhash functions (#17152) @davidwendt
  • Remove unused variable in internal merge_tdigests utility (#17151) @davidwendt
  • Use the full ref name of rmm.DeviceBuffer in the sphinx config file (#17150) @Matt711
  • Move segmented_gather function from the copying module to the lists module (#17148) @Matt711
  • Use async execution policy for true_if (#17146) @PointKernel
  • Add conversion from cudf-polars expressions to libcudf ast for parquet filters (#17141) @wence-
  • devcontainer: replace VAULT_HOST with AWS_ROLE_ARN (#17134) @jjacobelli
  • Replace direct cudaMemcpyAsync calls with utility functions (limited to cudf::io) (#17132) @vuule
  • use rapids-generate-pip-constraints to pin to oldest dependencies in CI (#17131) @jameslamb
  • Set the default number of threads in KvikIO thread pool to 8 (#17126) @kingcrimsontianyu
  • Fix clang-tidy violations for span.hpp and hostdevice_vector.hpp (#17124) @davidwendt
  • Disable the Parquet reader's wide lists tables GTest by default (#17120) @mhaseeb123
  • Add compile time check to ensure the counting_iterator type in counting_transform_iterator fits in size_type (#17118) @mhaseeb123
  • Minor I/O code quality improvements (#17105) @kingcrimsontianyu
  • Remove the additional host register calls initially intended for performance improvement on Grace Hopper (#17092) @kingcrimsontianyu
  • Split hash-based groupby into multiple smaller files to reduce build time (#17089) @PointKernel
  • build wheels without build isolation (#17088) @jameslamb
  • Remove unused hash helper functions (#17056) @PointKernel
  • Add todlpack/fromdlpack APIs to pylibcudf (#17055) @mroeschke
  • Move flatten_single_pass_aggs to its own TU (#17053) @PointKernel
  • Replace deprecated cuco APIs with updated versions (#17052) @PointKernel
  • Refactor ORC dictionary encoding to migrate to the new cuco::static_map (#17049) @mhaseeb123
  • Move pylibcudf/libcudf/wrappers/decimals to pylibcudf/libcudf/fixed_point (#17048) @mroeschke
  • make conda installs in CI stricter (part 2) (#17042) @jameslamb
  • Use managed memory for NDSH benchmarks (#17039) @karthikeyann
  • Clean up hash-groupby var_hash_functor (#17034) @PointKernel
  • Add json APIs to pylibcudf (#17025) @mroeschke
  • Add string.replace_re APIs to pylibcudf (#17023) @mroeschke
  • Replace old host tree algorithm with new algorithm in JSON reader (#17019) @karthikeyann
  • Unify treatment of Expr and IR nodes in cudf-polars DSL (#17016) @wence-
  • make conda installs in CI stricter (#17013) @jameslamb
  • Pylibcudf: pack and unpack (#17012) @madsbk
  • Remove unneeded pylibcudf.libcudf.wrappers.duration usage in cudf (#17010) @mroeschke
  • Add custom "fused" groupby aggregation to Dask cuDF (#17009) @rjzamora
  • Make tests more deterministic (#17008) @galipremsagar
  • Remove unused import (#17005) @Matt711
  • Add string.convert.convert_urls APIs to pylibcudf (#17003) @mroeschke
  • Add release tracking to project automation scripts (#17001) @jarmak-nv
  • Add string.convert.convert_lists APIs to pylibcudf (#16997) @mroeschke
  • Performance optimization of JSON validation (#16996) @karthikeyann
  • Add string.convert.convert_ipv4 APIs to pylibcudf (#16994) @mroeschke
  • Add string.convert.convert_integers APIs to pylibcudf (#16991) @mroeschke
  • Add string.convert_floats APIs to pylibcudf (#16990) @mroeschke
  • Add string.convert.convertfixedtype APIs to pylibcudf (#16984) @mroeschke
  • Remove unnecessary std::move's in pylibcudf (#16983) @Matt711
  • Add docstrings and test for strings.convert_durations APIs for pylibcudf (#16982) @mroeschke
  • JSON tokenizer memory optimizations (#16978) @shrshi
  • Turn on xfail_strict = true for all python packages (#16977) @wence-
  • Add string.convert.convertdatetime/convertbooleans APIs to pylibcudf (#16971) @mroeschke
  • Auto assign PR to author (#16969) @Matt711
  • Deprecate support for directly accessing logger (#16964) @vyasr
  • Expunge NamedColumn (#16962) @wence-
  • Add clang-tidy to CI (#16958) @vyasr
  • Address all remaining clang-tidy errors (#16956) @vyasr
  • Apply clang-tidy autofixes (#16949) @vyasr
  • Use nvcomp wheel instead of bundling nvcomp (#16946) @KyleFromNVIDIA
  • Refactor the cuda_memcpy functions to make them more usable (#16945) @vuule
  • Add string.split APIs to pylibcudf (#16940) @mroeschke
  • clang-tidy fixes part 3 (#16939) @vyasr
  • clang-tidy fixes part 2 (#16938) @vyasr
  • clang-tidy fixes part 1 (#16937) @vyasr
  • Add string.wrap APIs to pylibcudf (#16935) @mroeschke
  • Add string.translate APIs to pylibcudf (#16934) @mroeschke
  • Add string.find_multiple APIs to pylibcudf (#16920) @mroeschke
  • Batch memcpy the last offsets for output buffers of str and list cols in PQ reader (#16905) @mhaseeb123
  • reduce wheel build verbosity, narrow deprecation warning filter (#16896) @jameslamb
  • Improve aggregation device functors (#16884) @PointKernel
  • Upgrade pandas pinnings to support 2.2.3 (#16882) @galipremsagar
  • Fix 24.10 to 24.12 forward merge (#16876) @bdice
  • Manually resolve conflicts in between branch-24.12 and branch-24.10 (#16871) @galipremsagar
  • Add in support for setting delim when parsing JSON through java (#16867) @revans2
  • Reapply mixed_semi_join refactoring and bug fixes (#16859) @mhaseeb123
  • Add string padding and side_type APIs to pylibcudf (#16833) @mroeschke
  • Organize parquet reader mukernel non-nullable code, introduce manual block scans (#16830) @pmattione-nvidia
  • Remove superfluous use of std::vector for std::future (#16829) @kingcrimsontianyu
  • Rework read_csv IO to avoid reading whole input with a single host_read (#16826) @vuule
  • Add strings.combine APIs to pylibcudf (#16790) @mroeschke
  • Add remaining string.char_types APIs to pylibcudf (#16788) @mroeschke
  • Avoid public constructors when called with columns to avoid unnecessary validation (#16747) @mroeschke
  • Use changed-files shared workflow (#16713) @KyleFromNVIDIA
  • lint: replace isort with Ruff's rule I (#16685) @Borda
  • Parquet reader list microkernel (#16538) @pmattione-nvidia
  • Refactor histogram reduction using cuco::static_set::insert_and_find (#16485) @srinivasyadav18
  • Use numba-cuda>=0.0.13 (#16474) @gmarkall

- C++
Published by rapids-bot[bot] over 1 year ago

https://github.com/rapidsai/cudf - [NIGHTLY] v24.08.00

πŸ”— Links

🚨 Breaking Changes

  • Align Index init APIs with pandas 2.x (#16362) @mroeschke
  • Align Series APIs with pandas 2.x (#16333) @mroeschke
  • Add missing stream param to dictionary factory APIs (#16319) @JayjeetAtGithub
  • Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
  • Remove squeeze argument from groupby (#16312) @mroeschke
  • Align more DataFrame APIs with pandas (#16310) @mroeschke
  • Remove mr param from write_csv and write_json (#16231) @JayjeetAtGithub
  • Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
  • Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
  • Deprecate Arrow support in I/O (#16132) @lithomas1
  • Return FrozenList for Index.names (#16047) @galipremsagar
  • Add compile option to enable large strings support (#16037) @davidwendt
  • Hide visibility of non public symbols (#15982) @robertmaynard
  • Rename strings multiple target replace API (#15898) @davidwendt
  • Pinned vector factory that uses the global pool (#15895) @vuule
  • Apply clang-tidy autofixes (#15894) @vyasr
  • Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow (#15875) @mhaseeb123
  • Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
  • Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
  • Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice

πŸ› Bug Fixes

  • Ensure managed memory is supported in cudf.pandas. (#16552) @bdice
  • Add flatbuffers to libcudf build (#16446) @galipremsagar
  • Fix parquetfieldlist read_func lambda capture invalid this pointer (#16440) @davidwendt
  • Enable prefetching in cudf.pandas.install() (#16439) @bdice
  • Enable prefetching before runpy (#16427) @galipremsagar
  • Support thread-safe for prefetch_config::get and prefetch_config::set (#16425) @ttnghia
  • Fix a pandas-2.0 missing attribute error (#16416) @galipremsagar
  • [Bug] Remove loud NativeFile deprecation noise for read_parquet from S3 (#16415) @rjzamora
  • Fix nightly memcheck error for empty STREAMINTEROPTEST (#16406) @davidwendt
  • Gate ArrowStringArrayNumpySemantics cudf.pandas proxy behind version check (#16401) @mroeschke
  • Don't export bsthreadpool (#16398) @KyleFromNVIDIA
  • Require fixed width types for casting in cudf-polars (#16381) @brandon-b-miller
  • Fix docstring of DataFrame.apply (#16351) @galipremsagar
  • Make bool raise for more cudf objects (#16311) @mroeschke
  • Rename .devcontainers for CUDA 12.5 (#16293) @jakirkham
  • Fix split_record for all empty strings column (#16291) @davidwendt
  • Fix logic in to_arrow for empty list column (#16279) @wence-
  • [BUG] Make name attr of Index fast slow attrs (#16270) @Matt711
  • Add custom name setter and getter for proxy objects in cudf.pandas (#16234) @Matt711
  • Fall back when casting a timestamp to numeric in cudf-polars (#16232) @brandon-b-miller
  • Disable large string support for Java build (#16216) @jlowe
  • Remove CCCL patch for PR 211. (#16207) @bdice
  • Add single offset to an empty ListArray in cudf::to_arrow (#16201) @davidwendt
  • Fix memory_usage when calculating nested list column (#16193) @mroeschke
  • Support at/iat indexers in cudf.pandas (#16177) @mroeschke
  • Fix unused-return-value debug build error in fromarrowstream_test.cpp (#16168) @davidwendt
  • Fix cudf::strings::replace_multiple hang on empty target (#16167) @davidwendt
  • Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
  • interpolate returns new column if no values are interpolated (#16158) @mroeschke
  • Use provided memory resource for allocating mixed join results. (#16153) @bdice
  • Run DFG after verify-alpha-spec (#16151) @KyleFromNVIDIA
  • Use size_t to allow large conditional joins (#16127) @bdice
  • Allow only scale=0 fixed-point values in fixedwidthcolumn_wrapper (#16120) @davidwendt
  • Fix pylibcudf Table.num_rows for 0 columns case and add interop to docs (#16108) @lithomas1
  • Add support for proxy np.flatiter objects (#16107) @Matt711
  • Ensure cudf objects can astype to any type when empty (#16106) @mroeschke
  • Support pd.read_pickle and pd.to_pickle in cudf.pandas (#16105) @Matt711
  • Fix unnecessarily strict check in parquet chunked reader for choosing split locations. (#16099) @nvdbaranec
  • Fix is_monotonic_* APIs to include nan&#39;s (#16085) @galipremsagar
  • More safely parse CUDA versions when subprocess output is contaminated (#16067) @brandon-b-miller
  • fastslowproxy: Don't import assert_eq at top-level (#16063) @wence-
  • Prevent bad ColumnAccessor state after .sortindex(axis=1, ignoreindex=True) (#16061) @mroeschke
  • Fix ArrowDeviceArray interface to pass address of event (#16058) @zeroshade
  • Fix a size overflow bug in hash groupby (#16053) @PointKernel
  • Fix atomic_ref scope when multiple blocks are updating the same output (#16051) @vuule
  • Fix initialization error in to_arrow for empty string views (#16033) @wence-
  • Fix the int32 overflow when computing page fragment sizes for large string columns (#16028) @mhaseeb123
  • Fix the pool size alignment issue (#16024) @PointKernel
  • Improve multibyte-split byte-range performance (#16019) @davidwendt
  • Fix target counting in strings char-parallel replace (#16017) @davidwendt
  • Support IntervalDtype in cudf.from_pandas (#16014) @mroeschke
  • Fix memory size in createbyterangeinfosconsecutive (#16012) @davidwendt
  • Hide visibility of non public symbols (#15982) @robertmaynard
  • Fix Cython typo preventing proper inheritance (#15978) @vyasr
  • Fix convertdtypes with convertinteger=False/convert_floating=True (#15964) @mroeschke
  • Fix nunique for MultiIndex, DataFrame, and all NA case with dropna=False (#15962) @mroeschke
  • Explicitly build for all GPU architectures (#15959) @vyasr
  • Preserve column type and class information in more DataFrame operations (#15949) @mroeschke
  • Add array_interface to cudf.pandas numpy.ndarray proxy (#15936) @mroeschke
  • Allow tests to be built when stream util is disabled (#15933) @robertmaynard
  • Fix JSON multi-source reading when total source size exceeds INT_MAX bytes (#15930) @shrshi
  • Fix dask_cudf.read_parquet regression for legacy timestamp data (#15929) @rjzamora
  • Fix offsetalator when accessing over 268 million rows (#15921) @davidwendt
  • Fix debug assert in rowgroupcharcounts_kernel (#15902) @davidwendt
  • Fix categorical conversion from chunked arrow arrays (#15886) @vyasr
  • Handling for NaN and inf when converting floating point to fixed point types (#15885) @ttnghia
  • Manual merge of Branch 24.08 from 24.06 (#15869) @galipremsagar
  • Avoid unnecessary Index cast in IndexedFrame.index setter (#15843) @charlesbluca
  • Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
  • Fix multi-replace target count logic for large strings (#15807) @davidwendt
  • Fix JSON parsing memory corruption - Fix Mixed types nested children removal (#15798) @karthikeyann
  • Allow anonymous user in devcontainer name. (#15784) @bdice
  • Add support for additional metaclasses of proxies and use for ExcelWriter (#15399) @vyasr

πŸ“– Documentation

  • Improve Polars docs (#16820) @bdice
  • Add docstring for from_dataframe (#16260) @mroeschke
  • Update libcudf compiler requirements in contributing doc (#16103) @davidwendt
  • Add libcudf public/detail API pattern to developer guide (#16086) @davidwendt
  • Explain line profiler and how to know which functions are GPU-accelerated. (#16079) @bdice
  • cudf.pandas documentation improvement (#15948) @Matt711
  • Reland "Fix docs for IO readers and strings_convert" (#15872)" (#15941) @lithomas1
  • Document how to use cudf.pandas in tandem with multiprocessing (#15940) @wence-
  • DOC: Add documentation for cudf.pandas in the Developer Guide (#15889) @Matt711
  • Improve options docs (#15888) @bdice
  • DOC: add linkcode to docs (#15860) @raybellwaves
  • DOC: use intersphinx mapping in pandas-compat ext (#15846) @raybellwaves
  • Fix inconsistent usage of 'results' and 'records' in read-json.md (#15766) @dagardner-nv
  • Update PandasCompat.py to resolve references (#15704) @raybellwaves

πŸš€ New Features

  • Creation of CI artifacts for cudf-polars wheels (#16680) @wence-
  • Warn on cuDF failure when POLARS_VERBOSE is true (#16308) @brandon-b-miller
  • Add drop_nulls in cudf-polars (#16290) @brandon-b-miller
  • [JNI] Add setKernelPinnedCopyThreshold and setPinnedAllocationThreshold (#16288) @abellina
  • Implement support for scan_ndjson in cudf-polars (#16263) @lithomas1
  • Publish cudf-polars nightlies (#16213) @lithomas1
  • Modify make_host_vector and make_device_uvector factories to optionally use pinned memory and kernel copy (#16206) @vuule
  • Migrate lists/set_operations to pylibcudf (#16190) @Matt711
  • Migrate lists/filling to pylibcudf (#16189) @Matt711
  • Fall back to CPU for unsupported libcudf binaryops in cudf-polars (#16188) @brandon-b-miller
  • Use resourceref for upstream in streamcheckingresourceadaptor (#16187) @harrism
  • Migrate lists/modifying to pylibcudf (#16185) @Matt711
  • Migrate lists/filtering to pylibcudf (#16184) @Matt711
  • Migrate lists/sorting to pylibcudf (#16179) @Matt711
  • Add missing methods to lists/listcolumnview.pxd in pylibcudf (#16175) @Matt711
  • Migrate pylibcudf lists gathering (#16170) @Matt711
  • Move kernel vis over to CUDF_HIDDEN (#16165) @robertmaynard
  • Add groupby_max multi-threaded benchmark (#16154) @srinivasyadav18
  • Promote hasnestedcolumns to cudf public API (#16131) @robertmaynard
  • Promote IO support queries to cudf API (#16125) @robertmaynard
  • cudf::merge public API now support passing a user stream (#16124) @robertmaynard
  • Add TPC-H inspired examples for Libcudf (#16088) @JayjeetAtGithub
  • Installed cudf header use cudf::allocate_like (#16087) @robertmaynard
  • cudf-polars string slicing (#16082) @brandon-b-miller
  • Migrate Parquet reader to pylibcudf (#16078) @lithomas1
  • Migrate lists/count_elements to pylibcudf (#16072) @Matt711
  • Migrate lists/extract to pylibcudf (#16071) @Matt711
  • Move common string utilities to public api (#16070) @robertmaynard
  • stable_distinct public api now has a stream parameter (#16068) @robertmaynard
  • Migrate expressions to pylibcudf (#16056) @lithomas1
  • Add support to ArrowDataSource in SourceInfo (#16050) @lithomas1
  • Experimental support for configurable prefetching (#16020) @vyasr
  • Migrate CSV reader to pylibcudf (#16011) @lithomas1
  • Migrate string slice APIs to pylibcudf (#15988) @brandon-b-miller
  • Migrate lists/contains to pylibcudf (#15981) @Matt711
  • Remove CCCL 2.2 patches as we now always use 2.5+ (#15969) @robertmaynard
  • Migrate JSON reader to pylibcudf (#15966) @lithomas1
  • Add a developer check for proxy objects (#15956) @Matt711
  • Start migrating I/O writers to pylibcudf (starting with JSON) (#15952) @lithomas1
  • Kernel copy for pinned memory (#15934) @vuule
  • Migrate left join and conditional join benchmarks to use nvbench (#15931) @srinivasyadav18
  • Migrate lists/combine to pylibcudf (#15928) @Matt711
  • Plumb pylibcudf strings contains_re through cudf_polars (#15918) @brandon-b-miller
  • Start migrating I/O to pylibcudf (#15899) @lithomas1
  • Pinned vector factory that uses the global pool (#15895) @vuule
  • Migrate strings contains operations to pylibcudf (#15880) @brandon-b-miller
  • Migrate quantile.pxd to pylibcudf (#15874) @lithomas1
  • Migrate round to pylibcudf (#15863) @lithomas1
  • Migrate string replace.pxd to pylibcudf (#15839) @lithomas1
  • Add an Environment Variable for debugging the fast path in cudf.pandas (#15837) @Matt711
  • Add an option to run cuIO benchmarks with pinned buffers as input (#15830) @vuule
  • Update pylibcudf testing utilities (#15772) @brandon-b-miller
  • Migrate string capitalize APIs to pylibcudf (#15503) @brandon-b-miller
  • Add tests for pylibcudf binaryops (#15470) @brandon-b-miller
  • Migrate column factories to pylibcudf (#15257) @brandon-b-miller
  • cuDF/libcudf exponentially weighted moving averages (#9027) @brandon-b-miller

πŸ› οΈ Improvements

  • Ensure objects with interface are converted to cupy/numpy arrays (#16436) @mroeschke
  • Add about rmm modes in cudf.pandas docs (#16404) @galipremsagar
  • Gracefully CUDFFAIL when `skiprows > 0` in Chunked Parquet reader (#16385) @mhaseeb123
  • Make C++ compilation warning free after #16297 (#16379) @wence-
  • Align Index init APIs with pandas 2.x (#16362) @mroeschke
  • Use rapidscpmbsthreadpool() (#16360) @KyleFromNVIDIA
  • Rename PrefetchConfig to prefetch_config. (#16358) @bdice
  • Implement parquet reading using pylibcudf in cudf-polars (#16346) @lithomas1
  • Fix compile warnings with jni_utils.hpp (#16336) @ttnghia
  • Align Series APIs with pandas 2.x (#16333) @mroeschke
  • Add missing stream param to dictionary factory APIs (#16319) @JayjeetAtGithub
  • Mark cudf._typing as a typing module in ruff (#16318) @mroeschke
  • Add stream param to list explode APIs (#16317) @JayjeetAtGithub
  • Fix polars for 1.2.1 (#16316) @lithomas1
  • Use workflow branch 24.08 again (#16314) @KyleFromNVIDIA
  • Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
  • Remove squeeze argument from groupby (#16312) @mroeschke
  • Align more DataFrame APIs with pandas (#16310) @mroeschke
  • Clean unneeded/redudant dtype utils (#16309) @mroeschke
  • Implement read_csv in cudf-polars using pylibcudf (#16307) @lithomas1
  • Use Column.cancastsafely instead of some ad-hoc dtype functions in .where (#16303) @mroeschke
  • Drop {{ pin_compatible(&#39;numpy&#39;, max_pin=&#39;x&#39;) }} (#16301) @jakirkham
  • Host implementation of to_arrow using nanoarrow (#16297) @zeroshade
  • Add ability to prefetch in cudf.pandas and change default to managed pool (#16296) @galipremsagar
  • Fix tests for polars 1.2 (#16292) @lithomas1
  • Introduce dedicated options for low memory readers (#16289) @galipremsagar
  • Remove decimal/floating 64/128bit switches due to register pressure (#16287) @pmattione-nvidia
  • Make ColumnAccessor strictly require a mapping of columns (#16285) @mroeschke
  • Introduce version file so we can conditionally handle things in tests (#16280) @wence-
  • Type & reduce cupy usage (#16277) @mroeschke
  • Update cudf::detail::grid1d to use threadindex_type (#16276) @davidwendt
  • Replace np.isscalar/issubdtype checks with is_scalar/.kind checks (#16275) @mroeschke
  • Remove xml from sortninjalog.py utility (#16274) @davidwendt
  • Fix issue in horizontal concat implementation in cudf-polars (#16271) @wence-
  • Preserve order in left join for cudf-polars (#16268) @wence-
  • Replace isdatetime/timedeltadtype checks with .kind checks (#16262) @mroeschke
  • Replace isfloat/integerdtype checks with .kind checks (#16261) @mroeschke
  • Build and test with CUDA 12.5.1 (#16259) @KyleFromNVIDIA
  • Replace isbooltype with checking .dtype.kind (#16255) @mroeschke
  • remove cuco_noexcept.diff (#16254) @trxcllnt
  • Update contains_tests.cpp to use public cudf::slice (#16253) @davidwendt
  • Improve the test data for pylibcudf I/O tests (#16247) @lithomas1
  • Short circuit some Column methods (#16246) @mroeschke
  • Make nvcomp adapter compatible with new version macros (#16245) @vuule
  • Add Column.strftime/strptime instead of overloading as_string/datetime/timedelta_column (#16243) @mroeschke
  • Remove temporary functor overloads required by cuco version bump (#16242) @PointKernel
  • Remove hashcharacterngrams dependency from jaccard_index (#16241) @davidwendt
  • Expose sorted groupby parameters to pylibcudf (#16240) @wence-
  • Expose reflection to check if casting between two types is supported (#16239) @wence-
  • Handle nans in groupby-aggregations in polars executor (#16233) @wence-
  • Remove mr param from write_csv and write_json (#16231) @JayjeetAtGithub
  • Support Literals in groupby-agg (#16218) @wence-
  • Handler csv reader options in cudf-polars (#16211) @wence-
  • Update vendored thread_pool implementation (#16210) @wence-
  • Add low memory JSON reader for cudf.pandas (#16204) @galipremsagar
  • Clean up state variables in MultiIndex (#16203) @mroeschke
  • skip CMake 3.30.0 (#16202) @jameslamb
  • Assert valid metadata is passed in toarrow for listview (#16198) @wence-
  • Expose type traits to pylibcudf (#16197) @wence-
  • Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
  • Cast count aggs to correct dtype in translation (#16192) @wence-
  • Some small fixes in cudf-polars (#16191) @wence-
  • split up CUDA-suffixed dependencies in dependencies.yaml (#16183) @jameslamb
  • Define PTDS for the stream hook libs (#16182) @trxcllnt
  • Make test_python_cudf_pandas generate requirements.txt (#16181) @trxcllnt
  • Add environment-agnostic ci/run_cudf_polars_pytest.sh (#16178) @trxcllnt
  • Implement translation for some unary functions and a single datetime extraction (#16173) @wence-
  • Remove size constraints on source files in batched JSON reading (#16162) @shrshi
  • CI: Build wheels for cudf-polars (#16156) @lithomas1
  • Update cudf-polars for v1 release of polars (#16149) @wence-
  • Use strings concatenate to support large strings in CSV writer (#16148) @davidwendt
  • Use verify-alpha-spec hook (#16144) @KyleFromNVIDIA
  • Adds write-coalescing code path optimization to FST (#16143) @elstehle
  • MAINT: Adapt to NumPy 2 promotion changes (#16141) @seberg
  • API: Check for integer overflows when creating scalar form python int (#16140) @seberg
  • Remove the (unused) implementation of host_parse_nested_json (#16135) @vuule
  • Deprecate Arrow support in I/O (#16132) @lithomas1
  • Disable dict support for split-page kernel in the parquet reader. (#16128) @nvdbaranec
  • Add throughput metrics for REDUCTIONBENCH/REDUCTIONNVBENCH benchmarks (#16126) @jihoonson
  • Add ensure_index to not unnecessarily shallow copy cudf.Index (#16117) @mroeschke
  • Make binary operators work between fixed-point and floating args (#16116) @pmattione-nvidia
  • Implement Ternary copyifelse (#16114) @wence-
  • Implement handlers for series literal in cudf-polars (#16113) @wence-
  • Fix dtype errors in StringArrays (#16111) @galipremsagar
  • Ensure MultiIndex.to_frame deep copies columns (#16110) @mroeschke
  • Parallelize gpuInitStringDescriptors for fixed length byte array data (#16109) @mhaseeb123
  • Finish implementation of cudf-polars boolean function handlers (#16098) @wence-
  • Expose and then implement support for cross joins in cudf-polars (#16097) @wence-
  • Defer copying in Column.astype(copy=True) (#16095) @mroeschke
  • Fix segfault in conditional join (#16094) @bdice
  • Free temp memory no longer needed in multibyte_split processing (#16091) @davidwendt
  • Rename gather/scatter benchmarks to clarify coalesced behavior. (#16083) @bdice
  • Adapt to polars upstream changes and turn on CI testing (#16081) @wence-
  • Reduce/clean copy usage in Series, reshaping (#16080) @mroeschke
  • Account for FIXEDLENBYTE_ARRAY when calculating fragment sizes in Parquet writer (#16064) @etseidl
  • Reduce (shallow) copies in DataFrame ops (#16060) @mroeschke
  • Add multi-file support to dask_cudf.read_json (#16057) @rjzamora
  • Reduce deep copies in Index ops (#16054) @mroeschke
  • Implement chunked column wise concat in chunked parquet reader (#16052) @galipremsagar
  • Add exception when trying to create large strings with cudf::test::stringscolumnwrapper (#16049) @davidwendt
  • Return FrozenList for Index.names (#16047) @galipremsagar
  • Add ast cast test (#16045) @pmattione-nvidia
  • Remove override_dtypes and include_index from Frame._copy_type_metadata (#16043) @mroeschke
  • Add ruff rules to avoid importing from typing (#16040) @mroeschke
  • Fix decimal -> float cast in ast code (#16038) @pmattione-nvidia
  • Add compile option to enable large strings support (#16037) @davidwendt
  • Reduce conditional_join nvbench configurations (#16036) @srinivasyadav18
  • Project automation update: skip if not in project (#16035) @jarmak-nv
  • Add stream parameter to cudf::io::text::multibyte_split (#16034) @davidwendt
  • Delete unused code from stringfunction evaluator (#16032) @wence-
  • Fix exclude regex in pre-commit clang-format hook (#16030) @wence-
  • Refactor rmm usage in cudf.pandas (#16021) @galipremsagar
  • Enable ruff TCH: typing imports under if TYPE_CHECKING (#16015) @mroeschke
  • Restrict the allowed pandas timezone objects in cudf (#16013) @mroeschke
  • orc multithreaded benchmark (#16009) @zpuller
  • Add tests of expression-based sort and sort-by (#16008) @wence-
  • Add tests of implemented StringFunctions (#16007) @wence-
  • Add test that diagonal concat with mismatching schemas raises (#16006) @wence-
  • Add coverage selecting len from a dataframe (number of rows) (#16005) @wence-
  • Add basic tests of dataframe scan (#16003) @wence-
  • Add coverage for both expression and dataframe filter (#16002) @wence-
  • Remove deprecated ExtContext node (#16001) @wence-
  • Fix typo bug in gather implementation (#16000) @wence-
  • Extend coverage of groupby and rolling window nodes (#15999) @wence-
  • Coverage of binops where one or both operands are a scalar (#15998) @wence-
  • Add full coverage for whole-frame Agg expressions (#15997) @wence-
  • Add tests covering magic methods of Expr objects (#15996) @wence-
  • Add full coverage of utility functions (#15995) @wence-
  • Test behaviour of containers (#15994) @wence-
  • Fix implemention of any, all, and isbetween (#15993) @wence-
  • Raise early on unhandled PythonScan node (#15992) @wence-
  • Remove mapfunction nodes that don't exist/aren't supported (#15991) @wence-
  • Add test coverage for slicing with "out of bounds" negative indices (#15990) @wence-
  • Standardize and type Series.dt methods (#15987) @mroeschke
  • Refactor distinct with hashset-based algorithms (#15984) @srinivasyadav18
  • resolve dependency-file-generator warning, remove unnecessary rapids-build-backend configuration (#15980) @jameslamb
  • Project automation bug fixes (#15971) @jarmak-nv
  • Add typing to singlecolumnframe (#15965) @mroeschke
  • Move some misc Frame methods to appropriate locations (#15963) @mroeschke
  • Condense pylibcudf data fixtures (#15958) @lithomas1
  • Refactor fillna logic to push specifics toward Frame subclasses and Column subclasses (#15957) @mroeschke
  • Remove unused parsing utilities (#15955) @vuule
  • Remove Scalar container type from polars interpreter (#15953) @wence-
  • Support arbitrary CUDA versions in UDF code (#15950) @bdice
  • Support large strings in cudf::io::text::multibyte_split (#15947) @davidwendt
  • Add external issue label and project automation (#15945) @jarmak-nv
  • Enable round-tripping of large strings in cudf (#15944) @galipremsagar
  • Add more complete type annotations in polars interpreter (#15942) @wence-
  • Update implementations to build with the latest cuco (#15938) @PointKernel
  • Support timezone aware pandas inputs in cudf (#15935) @mroeschke
  • Define Column.nanasnull to return self (#15923) @mroeschke
  • Make Frame._dtype an iterator instead of a dict (#15920) @mroeschke
  • Port start of datetime.hpp to pylibcudf (#15916) @wence-
  • Introduce NamedColumn concept in cudf-polars (#15914) @wence-
  • Avoid redefining Frame.getcolumnsbylabel in subclasses (#15912) @mroeschke
  • Templatization of fixed-width parquet decoding kernels. (#15911) @nvdbaranec
  • New Decimal <--> Floating conversion (#15905) @pmattione-nvidia
  • Use Arrow C Data Interface functions for Python interop (#15904) @vyasr
  • Use offsetalator in cudf::io::json::detail::parse_string (#15900) @davidwendt
  • Rename strings multiple target replace API (#15898) @davidwendt
  • Apply clang-tidy autofixes (#15894) @vyasr
  • Update Python labels and remove unnecessary ones (#15893) @vyasr
  • Clean up pylibcudf test assertations (#15892) @lithomas1
  • Use offsetalator in orc rowgroupcharcounts_kernel (#15891) @davidwendt
  • Ensure literals have correct dtype (#15890) @wence-
  • Add overflow check when converting large strings to lists columns (#15887) @davidwendt
  • Use offsetalator in nvtext::tokenizewithvocabulary (#15878) @davidwendt
  • Update interleave lists column for large strings (#15877) @davidwendt
  • Simple NumPy 2 fixes that are clearly no behavior change (#15876) @seberg
  • Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow (#15875) @mhaseeb123
  • Refactor join benchmarks to target public APIs with the default stream (#15873) @PointKernel
  • Fix url-decode benchmark to use offsetalator (#15871) @davidwendt
  • Use offsetalator in strings shift functor (#15870) @davidwendt
  • Memory Profiling (#15866) @madsbk
  • Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
  • Make Frame.astype return Self instead of a ColumnAccessor (#15861) @mroeschke
  • Use ColumnAccessor row and column length attributes more consistently (#15857) @mroeschke
  • add unit test setup for cudf_kafka (#15853) @jameslamb
  • Remove internal usage of core.index.as_index in favor of cudf.Index (#15851) @mroeschke
  • Ensure cudf.Series(cudf.Series(...)) creates a reference to the same index (#15845) @mroeschke
  • Remove benchmark-specific use of pinned-pooled memory in Parquet multithreaded benchmark. (#15838) @nvdbaranec
  • Implement on_bad_lines in json reader (#15834) @galipremsagar
  • Make Column.to_pandas return Index instead of Series (#15833) @mroeschke
  • Add test of interoperability of cuDF and arrow BYTESTREAMSPLIT encoders (#15832) @etseidl
  • Refactor Parquet writer options and builders (#15831) @etseidl
  • Migrate reshape.pxd to pylibcudf (#15827) @lithomas1
  • Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice
  • Switch cuIO benchmarks to use pinned-pool host allocations by default. (#15805) @nvdbaranec
  • Change thrust::count_if call to raw kernel in strings split APIs (#15762) @davidwendt
  • Improve performance for long strings for nvtext::replace_tokens (#15756) @davidwendt
  • Implement chunked parquet reader in cudf-python (#15728) @galipremsagar
  • Add from_arrow_host functions for cudf interop with nanoarrow (#15645) @zeroshade
  • Add ability to enable rmm pool on cudf.pandas import (#15628) @galipremsagar
  • Executor for polars logical plans (#15504) @wence-
  • Implement dayname and monthname to match pandas (#15479) @btepera
  • Utilities for decimal <--> floating conversion (#15359) @pmattione-nvidia
  • For powers of 10, replace ipow with switch (#15353) @pmattione-nvidia
  • Use rapids-build-backend. (#15245) @vyasr
  • Add codecov coverage for pandas_tests (#14513) @galipremsagar

- C++
Published by rapids-bot[bot] over 1 year ago

https://github.com/rapidsai/cudf - v24.08.03

🚨 Breaking Changes

  • Align Index init APIs with pandas 2.x (#16362) @mroeschke
  • Align Series APIs with pandas 2.x (#16333) @mroeschke
  • Add missing stream param to dictionary factory APIs (#16319) @JayjeetAtGithub
  • Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
  • Remove squeeze argument from groupby (#16312) @mroeschke
  • Align more DataFrame APIs with pandas (#16310) @mroeschke
  • Remove mr param from write_csv and write_json (#16231) @JayjeetAtGithub
  • Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
  • Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
  • Deprecate Arrow support in I/O (#16132) @lithomas1
  • Return FrozenList for Index.names (#16047) @galipremsagar
  • Add compile option to enable large strings support (#16037) @davidwendt
  • Hide visibility of non public symbols (#15982) @robertmaynard
  • Rename strings multiple target replace API (#15898) @davidwendt
  • Pinned vector factory that uses the global pool (#15895) @vuule
  • Apply clang-tidy autofixes (#15894) @vyasr
  • Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow (#15875) @mhaseeb123
  • Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
  • Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
  • Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice

πŸ› Bug Fixes

  • Ensure managed memory is supported in cudf.pandas. (#16552) @bdice
  • Add flatbuffers to libcudf build (#16446) @galipremsagar
  • Fix parquetfieldlist read_func lambda capture invalid this pointer (#16440) @davidwendt
  • Enable prefetching in cudf.pandas.install() (#16439) @bdice
  • Enable prefetching before runpy (#16427) @galipremsagar
  • Support thread-safe for prefetch_config::get and prefetch_config::set (#16425) @ttnghia
  • Fix a pandas-2.0 missing attribute error (#16416) @galipremsagar
  • [Bug] Remove loud NativeFile deprecation noise for read_parquet from S3 (#16415) @rjzamora
  • Fix nightly memcheck error for empty STREAMINTEROPTEST (#16406) @davidwendt
  • Gate ArrowStringArrayNumpySemantics cudf.pandas proxy behind version check (#16401) @mroeschke
  • Don't export bsthreadpool (#16398) @KyleFromNVIDIA
  • Require fixed width types for casting in cudf-polars (#16381) @brandon-b-miller
  • Fix docstring of DataFrame.apply (#16351) @galipremsagar
  • Make bool raise for more cudf objects (#16311) @mroeschke
  • Rename .devcontainers for CUDA 12.5 (#16293) @jakirkham
  • Fix split_record for all empty strings column (#16291) @davidwendt
  • Fix logic in to_arrow for empty list column (#16279) @wence-
  • [BUG] Make name attr of Index fast slow attrs (#16270) @Matt711
  • Add custom name setter and getter for proxy objects in cudf.pandas (#16234) @Matt711
  • Fall back when casting a timestamp to numeric in cudf-polars (#16232) @brandon-b-miller
  • Disable large string support for Java build (#16216) @jlowe
  • Remove CCCL patch for PR 211. (#16207) @bdice
  • Add single offset to an empty ListArray in cudf::to_arrow (#16201) @davidwendt
  • Fix memory_usage when calculating nested list column (#16193) @mroeschke
  • Support at/iat indexers in cudf.pandas (#16177) @mroeschke
  • Fix unused-return-value debug build error in fromarrowstream_test.cpp (#16168) @davidwendt
  • Fix cudf::strings::replace_multiple hang on empty target (#16167) @davidwendt
  • Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
  • interpolate returns new column if no values are interpolated (#16158) @mroeschke
  • Use provided memory resource for allocating mixed join results. (#16153) @bdice
  • Run DFG after verify-alpha-spec (#16151) @KyleFromNVIDIA
  • Use size_t to allow large conditional joins (#16127) @bdice
  • Allow only scale=0 fixed-point values in fixedwidthcolumn_wrapper (#16120) @davidwendt
  • Fix pylibcudf Table.num_rows for 0 columns case and add interop to docs (#16108) @lithomas1
  • Add support for proxy np.flatiter objects (#16107) @Matt711
  • Ensure cudf objects can astype to any type when empty (#16106) @mroeschke
  • Support pd.read_pickle and pd.to_pickle in cudf.pandas (#16105) @Matt711
  • Fix unnecessarily strict check in parquet chunked reader for choosing split locations. (#16099) @nvdbaranec
  • Fix is_monotonic_* APIs to include nan&#39;s (#16085) @galipremsagar
  • More safely parse CUDA versions when subprocess output is contaminated (#16067) @brandon-b-miller
  • fastslowproxy: Don't import assert_eq at top-level (#16063) @wence-
  • Prevent bad ColumnAccessor state after .sortindex(axis=1, ignoreindex=True) (#16061) @mroeschke
  • Fix ArrowDeviceArray interface to pass address of event (#16058) @zeroshade
  • Fix a size overflow bug in hash groupby (#16053) @PointKernel
  • Fix atomic_ref scope when multiple blocks are updating the same output (#16051) @vuule
  • Fix initialization error in to_arrow for empty string views (#16033) @wence-
  • Fix the int32 overflow when computing page fragment sizes for large string columns (#16028) @mhaseeb123
  • Fix the pool size alignment issue (#16024) @PointKernel
  • Improve multibyte-split byte-range performance (#16019) @davidwendt
  • Fix target counting in strings char-parallel replace (#16017) @davidwendt
  • Support IntervalDtype in cudf.from_pandas (#16014) @mroeschke
  • Fix memory size in createbyterangeinfosconsecutive (#16012) @davidwendt
  • Hide visibility of non public symbols (#15982) @robertmaynard
  • Fix Cython typo preventing proper inheritance (#15978) @vyasr
  • Fix convertdtypes with convertinteger=False/convert_floating=True (#15964) @mroeschke
  • Fix nunique for MultiIndex, DataFrame, and all NA case with dropna=False (#15962) @mroeschke
  • Explicitly build for all GPU architectures (#15959) @vyasr
  • Preserve column type and class information in more DataFrame operations (#15949) @mroeschke
  • Add array_interface to cudf.pandas numpy.ndarray proxy (#15936) @mroeschke
  • Allow tests to be built when stream util is disabled (#15933) @robertmaynard
  • Fix JSON multi-source reading when total source size exceeds INT_MAX bytes (#15930) @shrshi
  • Fix dask_cudf.read_parquet regression for legacy timestamp data (#15929) @rjzamora
  • Fix offsetalator when accessing over 268 million rows (#15921) @davidwendt
  • Fix debug assert in rowgroupcharcounts_kernel (#15902) @davidwendt
  • Fix categorical conversion from chunked arrow arrays (#15886) @vyasr
  • Handling for NaN and inf when converting floating point to fixed point types (#15885) @ttnghia
  • Manual merge of Branch 24.08 from 24.06 (#15869) @galipremsagar
  • Avoid unnecessary Index cast in IndexedFrame.index setter (#15843) @charlesbluca
  • Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
  • Fix multi-replace target count logic for large strings (#15807) @davidwendt
  • Fix JSON parsing memory corruption - Fix Mixed types nested children removal (#15798) @karthikeyann
  • Allow anonymous user in devcontainer name. (#15784) @bdice
  • Add support for additional metaclasses of proxies and use for ExcelWriter (#15399) @vyasr

πŸ“– Documentation

  • Add docstring for from_dataframe (#16260) @mroeschke
  • Update libcudf compiler requirements in contributing doc (#16103) @davidwendt
  • Add libcudf public/detail API pattern to developer guide (#16086) @davidwendt
  • Explain line profiler and how to know which functions are GPU-accelerated. (#16079) @bdice
  • cudf.pandas documentation improvement (#15948) @Matt711
  • Reland "Fix docs for IO readers and strings_convert" (#15872)" (#15941) @lithomas1
  • Document how to use cudf.pandas in tandem with multiprocessing (#15940) @wence-
  • DOC: Add documentation for cudf.pandas in the Developer Guide (#15889) @Matt711
  • Improve options docs (#15888) @bdice
  • DOC: add linkcode to docs (#15860) @raybellwaves
  • DOC: use intersphinx mapping in pandas-compat ext (#15846) @raybellwaves
  • Fix inconsistent usage of 'results' and 'records' in read-json.md (#15766) @dagardner-nv
  • Update PandasCompat.py to resolve references (#15704) @raybellwaves

πŸš€ New Features

  • Creation of CI artifacts for cudf-polars wheels (#16680) @wence-
  • Warn on cuDF failure when POLARS_VERBOSE is true (#16308) @brandon-b-miller
  • Add drop_nulls in cudf-polars (#16290) @brandon-b-miller
  • [JNI] Add setKernelPinnedCopyThreshold and setPinnedAllocationThreshold (#16288) @abellina
  • Implement support for scan_ndjson in cudf-polars (#16263) @lithomas1
  • Publish cudf-polars nightlies (#16213) @lithomas1
  • Modify make_host_vector and make_device_uvector factories to optionally use pinned memory and kernel copy (#16206) @vuule
  • Migrate lists/set_operations to pylibcudf (#16190) @Matt711
  • Migrate lists/filling to pylibcudf (#16189) @Matt711
  • Fall back to CPU for unsupported libcudf binaryops in cudf-polars (#16188) @brandon-b-miller
  • Use resourceref for upstream in streamcheckingresourceadaptor (#16187) @harrism
  • Migrate lists/modifying to pylibcudf (#16185) @Matt711
  • Migrate lists/filtering to pylibcudf (#16184) @Matt711
  • Migrate lists/sorting to pylibcudf (#16179) @Matt711
  • Add missing methods to lists/listcolumnview.pxd in pylibcudf (#16175) @Matt711
  • Migrate pylibcudf lists gathering (#16170) @Matt711
  • Move kernel vis over to CUDF_HIDDEN (#16165) @robertmaynard
  • Add groupby_max multi-threaded benchmark (#16154) @srinivasyadav18
  • Promote hasnestedcolumns to cudf public API (#16131) @robertmaynard
  • Promote IO support queries to cudf API (#16125) @robertmaynard
  • cudf::merge public API now support passing a user stream (#16124) @robertmaynard
  • Add TPC-H inspired examples for Libcudf (#16088) @JayjeetAtGithub
  • Installed cudf header use cudf::allocate_like (#16087) @robertmaynard
  • cudf-polars string slicing (#16082) @brandon-b-miller
  • Migrate Parquet reader to pylibcudf (#16078) @lithomas1
  • Migrate lists/count_elements to pylibcudf (#16072) @Matt711
  • Migrate lists/extract to pylibcudf (#16071) @Matt711
  • Move common string utilities to public api (#16070) @robertmaynard
  • stable_distinct public api now has a stream parameter (#16068) @robertmaynard
  • Migrate expressions to pylibcudf (#16056) @lithomas1
  • Add support to ArrowDataSource in SourceInfo (#16050) @lithomas1
  • Experimental support for configurable prefetching (#16020) @vyasr
  • Migrate CSV reader to pylibcudf (#16011) @lithomas1
  • Migrate string slice APIs to pylibcudf (#15988) @brandon-b-miller
  • Migrate lists/contains to pylibcudf (#15981) @Matt711
  • Remove CCCL 2.2 patches as we now always use 2.5+ (#15969) @robertmaynard
  • Migrate JSON reader to pylibcudf (#15966) @lithomas1
  • Add a developer check for proxy objects (#15956) @Matt711
  • Start migrating I/O writers to pylibcudf (starting with JSON) (#15952) @lithomas1
  • Kernel copy for pinned memory (#15934) @vuule
  • Migrate left join and conditional join benchmarks to use nvbench (#15931) @srinivasyadav18
  • Migrate lists/combine to pylibcudf (#15928) @Matt711
  • Plumb pylibcudf strings contains_re through cudf_polars (#15918) @brandon-b-miller
  • Start migrating I/O to pylibcudf (#15899) @lithomas1
  • Pinned vector factory that uses the global pool (#15895) @vuule
  • Migrate strings contains operations to pylibcudf (#15880) @brandon-b-miller
  • Migrate quantile.pxd to pylibcudf (#15874) @lithomas1
  • Migrate round to pylibcudf (#15863) @lithomas1
  • Migrate string replace.pxd to pylibcudf (#15839) @lithomas1
  • Add an Environment Variable for debugging the fast path in cudf.pandas (#15837) @Matt711
  • Add an option to run cuIO benchmarks with pinned buffers as input (#15830) @vuule
  • Update pylibcudf testing utilities (#15772) @brandon-b-miller
  • Migrate string capitalize APIs to pylibcudf (#15503) @brandon-b-miller
  • Add tests for pylibcudf binaryops (#15470) @brandon-b-miller
  • Migrate column factories to pylibcudf (#15257) @brandon-b-miller
  • cuDF/libcudf exponentially weighted moving averages (#9027) @brandon-b-miller

πŸ› οΈ Improvements

  • Ensure objects with interface are converted to cupy/numpy arrays (#16436) @mroeschke
  • Add about rmm modes in cudf.pandas docs (#16404) @galipremsagar
  • Gracefully CUDFFAIL when `skiprows > 0` in Chunked Parquet reader (#16385) @mhaseeb123
  • Make C++ compilation warning free after #16297 (#16379) @wence-
  • Align Index init APIs with pandas 2.x (#16362) @mroeschke
  • Use rapidscpmbsthreadpool() (#16360) @KyleFromNVIDIA
  • Rename PrefetchConfig to prefetch_config. (#16358) @bdice
  • Implement parquet reading using pylibcudf in cudf-polars (#16346) @lithomas1
  • Fix compile warnings with jni_utils.hpp (#16336) @ttnghia
  • Align Series APIs with pandas 2.x (#16333) @mroeschke
  • Add missing stream param to dictionary factory APIs (#16319) @JayjeetAtGithub
  • Mark cudf._typing as a typing module in ruff (#16318) @mroeschke
  • Add stream param to list explode APIs (#16317) @JayjeetAtGithub
  • Fix polars for 1.2.1 (#16316) @lithomas1
  • Use workflow branch 24.08 again (#16314) @KyleFromNVIDIA
  • Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
  • Remove squeeze argument from groupby (#16312) @mroeschke
  • Align more DataFrame APIs with pandas (#16310) @mroeschke
  • Clean unneeded/redudant dtype utils (#16309) @mroeschke
  • Implement read_csv in cudf-polars using pylibcudf (#16307) @lithomas1
  • Use Column.cancastsafely instead of some ad-hoc dtype functions in .where (#16303) @mroeschke
  • Drop {{ pin_compatible(&#39;numpy&#39;, max_pin=&#39;x&#39;) }} (#16301) @jakirkham
  • Host implementation of to_arrow using nanoarrow (#16297) @zeroshade
  • Add ability to prefetch in cudf.pandas and change default to managed pool (#16296) @galipremsagar
  • Fix tests for polars 1.2 (#16292) @lithomas1
  • Introduce dedicated options for low memory readers (#16289) @galipremsagar
  • Remove decimal/floating 64/128bit switches due to register pressure (#16287) @pmattione-nvidia
  • Make ColumnAccessor strictly require a mapping of columns (#16285) @mroeschke
  • Introduce version file so we can conditionally handle things in tests (#16280) @wence-
  • Type & reduce cupy usage (#16277) @mroeschke
  • Update cudf::detail::grid1d to use threadindex_type (#16276) @davidwendt
  • Replace np.isscalar/issubdtype checks with is_scalar/.kind checks (#16275) @mroeschke
  • Remove xml from sortninjalog.py utility (#16274) @davidwendt
  • Fix issue in horizontal concat implementation in cudf-polars (#16271) @wence-
  • Preserve order in left join for cudf-polars (#16268) @wence-
  • Replace isdatetime/timedeltadtype checks with .kind checks (#16262) @mroeschke
  • Replace isfloat/integerdtype checks with .kind checks (#16261) @mroeschke
  • Build and test with CUDA 12.5.1 (#16259) @KyleFromNVIDIA
  • Replace isbooltype with checking .dtype.kind (#16255) @mroeschke
  • remove cuco_noexcept.diff (#16254) @trxcllnt
  • Update contains_tests.cpp to use public cudf::slice (#16253) @davidwendt
  • Improve the test data for pylibcudf I/O tests (#16247) @lithomas1
  • Short circuit some Column methods (#16246) @mroeschke
  • Make nvcomp adapter compatible with new version macros (#16245) @vuule
  • Add Column.strftime/strptime instead of overloading as_string/datetime/timedelta_column (#16243) @mroeschke
  • Remove temporary functor overloads required by cuco version bump (#16242) @PointKernel
  • Remove hashcharacterngrams dependency from jaccard_index (#16241) @davidwendt
  • Expose sorted groupby parameters to pylibcudf (#16240) @wence-
  • Expose reflection to check if casting between two types is supported (#16239) @wence-
  • Handle nans in groupby-aggregations in polars executor (#16233) @wence-
  • Remove mr param from write_csv and write_json (#16231) @JayjeetAtGithub
  • Support Literals in groupby-agg (#16218) @wence-
  • Handler csv reader options in cudf-polars (#16211) @wence-
  • Update vendored thread_pool implementation (#16210) @wence-
  • Add low memory JSON reader for cudf.pandas (#16204) @galipremsagar
  • Clean up state variables in MultiIndex (#16203) @mroeschke
  • skip CMake 3.30.0 (#16202) @jameslamb
  • Assert valid metadata is passed in toarrow for listview (#16198) @wence-
  • Expose type traits to pylibcudf (#16197) @wence-
  • Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
  • Cast count aggs to correct dtype in translation (#16192) @wence-
  • Some small fixes in cudf-polars (#16191) @wence-
  • split up CUDA-suffixed dependencies in dependencies.yaml (#16183) @jameslamb
  • Define PTDS for the stream hook libs (#16182) @trxcllnt
  • Make test_python_cudf_pandas generate requirements.txt (#16181) @trxcllnt
  • Add environment-agnostic ci/run_cudf_polars_pytest.sh (#16178) @trxcllnt
  • Implement translation for some unary functions and a single datetime extraction (#16173) @wence-
  • Remove size constraints on source files in batched JSON reading (#16162) @shrshi
  • CI: Build wheels for cudf-polars (#16156) @lithomas1
  • Update cudf-polars for v1 release of polars (#16149) @wence-
  • Use strings concatenate to support large strings in CSV writer (#16148) @davidwendt
  • Use verify-alpha-spec hook (#16144) @KyleFromNVIDIA
  • Adds write-coalescing code path optimization to FST (#16143) @elstehle
  • MAINT: Adapt to NumPy 2 promotion changes (#16141) @seberg
  • API: Check for integer overflows when creating scalar form python int (#16140) @seberg
  • Remove the (unused) implementation of host_parse_nested_json (#16135) @vuule
  • Deprecate Arrow support in I/O (#16132) @lithomas1
  • Disable dict support for split-page kernel in the parquet reader. (#16128) @nvdbaranec
  • Add throughput metrics for REDUCTIONBENCH/REDUCTIONNVBENCH benchmarks (#16126) @jihoonson
  • Add ensure_index to not unnecessarily shallow copy cudf.Index (#16117) @mroeschke
  • Make binary operators work between fixed-point and floating args (#16116) @pmattione-nvidia
  • Implement Ternary copyifelse (#16114) @wence-
  • Implement handlers for series literal in cudf-polars (#16113) @wence-
  • Fix dtype errors in StringArrays (#16111) @galipremsagar
  • Ensure MultiIndex.to_frame deep copies columns (#16110) @mroeschke
  • Parallelize gpuInitStringDescriptors for fixed length byte array data (#16109) @mhaseeb123
  • Finish implementation of cudf-polars boolean function handlers (#16098) @wence-
  • Expose and then implement support for cross joins in cudf-polars (#16097) @wence-
  • Defer copying in Column.astype(copy=True) (#16095) @mroeschke
  • Fix segfault in conditional join (#16094) @bdice
  • Free temp memory no longer needed in multibyte_split processing (#16091) @davidwendt
  • Rename gather/scatter benchmarks to clarify coalesced behavior. (#16083) @bdice
  • Adapt to polars upstream changes and turn on CI testing (#16081) @wence-
  • Reduce/clean copy usage in Series, reshaping (#16080) @mroeschke
  • Account for FIXEDLENBYTE_ARRAY when calculating fragment sizes in Parquet writer (#16064) @etseidl
  • Reduce (shallow) copies in DataFrame ops (#16060) @mroeschke
  • Add multi-file support to dask_cudf.read_json (#16057) @rjzamora
  • Reduce deep copies in Index ops (#16054) @mroeschke
  • Implement chunked column wise concat in chunked parquet reader (#16052) @galipremsagar
  • Add exception when trying to create large strings with cudf::test::stringscolumnwrapper (#16049) @davidwendt
  • Return FrozenList for Index.names (#16047) @galipremsagar
  • Add ast cast test (#16045) @pmattione-nvidia
  • Remove override_dtypes and include_index from Frame._copy_type_metadata (#16043) @mroeschke
  • Add ruff rules to avoid importing from typing (#16040) @mroeschke
  • Fix decimal -> float cast in ast code (#16038) @pmattione-nvidia
  • Add compile option to enable large strings support (#16037) @davidwendt
  • Reduce conditional_join nvbench configurations (#16036) @srinivasyadav18
  • Project automation update: skip if not in project (#16035) @jarmak-nv
  • Add stream parameter to cudf::io::text::multibyte_split (#16034) @davidwendt
  • Delete unused code from stringfunction evaluator (#16032) @wence-
  • Fix exclude regex in pre-commit clang-format hook (#16030) @wence-
  • Refactor rmm usage in cudf.pandas (#16021) @galipremsagar
  • Enable ruff TCH: typing imports under if TYPE_CHECKING (#16015) @mroeschke
  • Restrict the allowed pandas timezone objects in cudf (#16013) @mroeschke
  • orc multithreaded benchmark (#16009) @zpuller
  • Add tests of expression-based sort and sort-by (#16008) @wence-
  • Add tests of implemented StringFunctions (#16007) @wence-
  • Add test that diagonal concat with mismatching schemas raises (#16006) @wence-
  • Add coverage selecting len from a dataframe (number of rows) (#16005) @wence-
  • Add basic tests of dataframe scan (#16003) @wence-
  • Add coverage for both expression and dataframe filter (#16002) @wence-
  • Remove deprecated ExtContext node (#16001) @wence-
  • Fix typo bug in gather implementation (#16000) @wence-
  • Extend coverage of groupby and rolling window nodes (#15999) @wence-
  • Coverage of binops where one or both operands are a scalar (#15998) @wence-
  • Add full coverage for whole-frame Agg expressions (#15997) @wence-
  • Add tests covering magic methods of Expr objects (#15996) @wence-
  • Add full coverage of utility functions (#15995) @wence-
  • Test behaviour of containers (#15994) @wence-
  • Fix implemention of any, all, and isbetween (#15993) @wence-
  • Raise early on unhandled PythonScan node (#15992) @wence-
  • Remove mapfunction nodes that don't exist/aren't supported (#15991) @wence-
  • Add test coverage for slicing with "out of bounds" negative indices (#15990) @wence-
  • Standardize and type Series.dt methods (#15987) @mroeschke
  • Refactor distinct with hashset-based algorithms (#15984) @srinivasyadav18
  • resolve dependency-file-generator warning, remove unnecessary rapids-build-backend configuration (#15980) @jameslamb
  • Project automation bug fixes (#15971) @jarmak-nv
  • Add typing to singlecolumnframe (#15965) @mroeschke
  • Move some misc Frame methods to appropriate locations (#15963) @mroeschke
  • Condense pylibcudf data fixtures (#15958) @lithomas1
  • Refactor fillna logic to push specifics toward Frame subclasses and Column subclasses (#15957) @mroeschke
  • Remove unused parsing utilities (#15955) @vuule
  • Remove Scalar container type from polars interpreter (#15953) @wence-
  • Support arbitrary CUDA versions in UDF code (#15950) @bdice
  • Support large strings in cudf::io::text::multibyte_split (#15947) @davidwendt
  • Add external issue label and project automation (#15945) @jarmak-nv
  • Enable round-tripping of large strings in cudf (#15944) @galipremsagar
  • Add more complete type annotations in polars interpreter (#15942) @wence-
  • Update implementations to build with the latest cuco (#15938) @PointKernel
  • Support timezone aware pandas inputs in cudf (#15935) @mroeschke
  • Define Column.nanasnull to return self (#15923) @mroeschke
  • Make Frame._dtype an iterator instead of a dict (#15920) @mroeschke
  • Port start of datetime.hpp to pylibcudf (#15916) @wence-
  • Introduce NamedColumn concept in cudf-polars (#15914) @wence-
  • Avoid redefining Frame.getcolumnsbylabel in subclasses (#15912) @mroeschke
  • Templatization of fixed-width parquet decoding kernels. (#15911) @nvdbaranec
  • New Decimal <--> Floating conversion (#15905) @pmattione-nvidia
  • Use Arrow C Data Interface functions for Python interop (#15904) @vyasr
  • Use offsetalator in cudf::io::json::detail::parse_string (#15900) @davidwendt
  • Rename strings multiple target replace API (#15898) @davidwendt
  • Apply clang-tidy autofixes (#15894) @vyasr
  • Update Python labels and remove unnecessary ones (#15893) @vyasr
  • Clean up pylibcudf test assertations (#15892) @lithomas1
  • Use offsetalator in orc rowgroupcharcounts_kernel (#15891) @davidwendt
  • Ensure literals have correct dtype (#15890) @wence-
  • Add overflow check when converting large strings to lists columns (#15887) @davidwendt
  • Use offsetalator in nvtext::tokenizewithvocabulary (#15878) @davidwendt
  • Update interleave lists column for large strings (#15877) @davidwendt
  • Simple NumPy 2 fixes that are clearly no behavior change (#15876) @seberg
  • Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow (#15875) @mhaseeb123
  • Refactor join benchmarks to target public APIs with the default stream (#15873) @PointKernel
  • Fix url-decode benchmark to use offsetalator (#15871) @davidwendt
  • Use offsetalator in strings shift functor (#15870) @davidwendt
  • Memory Profiling (#15866) @madsbk
  • Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
  • Make Frame.astype return Self instead of a ColumnAccessor (#15861) @mroeschke
  • Use ColumnAccessor row and column length attributes more consistently (#15857) @mroeschke
  • add unit test setup for cudf_kafka (#15853) @jameslamb
  • Remove internal usage of core.index.as_index in favor of cudf.Index (#15851) @mroeschke
  • Ensure cudf.Series(cudf.Series(...)) creates a reference to the same index (#15845) @mroeschke
  • Remove benchmark-specific use of pinned-pooled memory in Parquet multithreaded benchmark. (#15838) @nvdbaranec
  • Implement on_bad_lines in json reader (#15834) @galipremsagar
  • Make Column.to_pandas return Index instead of Series (#15833) @mroeschke
  • Add test of interoperability of cuDF and arrow BYTESTREAMSPLIT encoders (#15832) @etseidl
  • Refactor Parquet writer options and builders (#15831) @etseidl
  • Migrate reshape.pxd to pylibcudf (#15827) @lithomas1
  • Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice
  • Switch cuIO benchmarks to use pinned-pool host allocations by default. (#15805) @nvdbaranec
  • Change thrust::count_if call to raw kernel in strings split APIs (#15762) @davidwendt
  • Improve performance for long strings for nvtext::replace_tokens (#15756) @davidwendt
  • Implement chunked parquet reader in cudf-python (#15728) @galipremsagar
  • Add from_arrow_host functions for cudf interop with nanoarrow (#15645) @zeroshade
  • Add ability to enable rmm pool on cudf.pandas import (#15628) @galipremsagar
  • Executor for polars logical plans (#15504) @wence-
  • Implement dayname and monthname to match pandas (#15479) @btepera
  • Utilities for decimal <--> floating conversion (#15359) @pmattione-nvidia
  • For powers of 10, replace ipow with switch (#15353) @pmattione-nvidia
  • Use rapids-build-backend. (#15245) @vyasr
  • Add codecov coverage for pandas_tests (#14513) @galipremsagar

- C++
Published by raydouglass over 1 year ago

https://github.com/rapidsai/cudf - v24.08.02

🚨 Breaking Changes

  • Align Index init APIs with pandas 2.x (#16362) @mroeschke
  • Align Series APIs with pandas 2.x (#16333) @mroeschke
  • Add missing stream param to dictionary factory APIs (#16319) @JayjeetAtGithub
  • Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
  • Remove squeeze argument from groupby (#16312) @mroeschke
  • Align more DataFrame APIs with pandas (#16310) @mroeschke
  • Remove mr param from write_csv and write_json (#16231) @JayjeetAtGithub
  • Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
  • Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
  • Deprecate Arrow support in I/O (#16132) @lithomas1
  • Return FrozenList for Index.names (#16047) @galipremsagar
  • Add compile option to enable large strings support (#16037) @davidwendt
  • Hide visibility of non public symbols (#15982) @robertmaynard
  • Rename strings multiple target replace API (#15898) @davidwendt
  • Pinned vector factory that uses the global pool (#15895) @vuule
  • Apply clang-tidy autofixes (#15894) @vyasr
  • Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow (#15875) @mhaseeb123
  • Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
  • Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
  • Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice

πŸ› Bug Fixes

  • Ensure managed memory is supported in cudf.pandas. (#16552) @bdice
  • Add flatbuffers to libcudf build (#16446) @galipremsagar
  • Fix parquetfieldlist read_func lambda capture invalid this pointer (#16440) @davidwendt
  • Enable prefetching in cudf.pandas.install() (#16439) @bdice
  • Enable prefetching before runpy (#16427) @galipremsagar
  • Support thread-safe for prefetch_config::get and prefetch_config::set (#16425) @ttnghia
  • Fix a pandas-2.0 missing attribute error (#16416) @galipremsagar
  • [Bug] Remove loud NativeFile deprecation noise for read_parquet from S3 (#16415) @rjzamora
  • Fix nightly memcheck error for empty STREAMINTEROPTEST (#16406) @davidwendt
  • Gate ArrowStringArrayNumpySemantics cudf.pandas proxy behind version check (#16401) @mroeschke
  • Don't export bsthreadpool (#16398) @KyleFromNVIDIA
  • Require fixed width types for casting in cudf-polars (#16381) @brandon-b-miller
  • Fix docstring of DataFrame.apply (#16351) @galipremsagar
  • Make bool raise for more cudf objects (#16311) @mroeschke
  • Rename .devcontainers for CUDA 12.5 (#16293) @jakirkham
  • Fix split_record for all empty strings column (#16291) @davidwendt
  • Fix logic in to_arrow for empty list column (#16279) @wence-
  • [BUG] Make name attr of Index fast slow attrs (#16270) @Matt711
  • Add custom name setter and getter for proxy objects in cudf.pandas (#16234) @Matt711
  • Fall back when casting a timestamp to numeric in cudf-polars (#16232) @brandon-b-miller
  • Disable large string support for Java build (#16216) @jlowe
  • Remove CCCL patch for PR 211. (#16207) @bdice
  • Add single offset to an empty ListArray in cudf::to_arrow (#16201) @davidwendt
  • Fix memory_usage when calculating nested list column (#16193) @mroeschke
  • Support at/iat indexers in cudf.pandas (#16177) @mroeschke
  • Fix unused-return-value debug build error in fromarrowstream_test.cpp (#16168) @davidwendt
  • Fix cudf::strings::replace_multiple hang on empty target (#16167) @davidwendt
  • Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
  • interpolate returns new column if no values are interpolated (#16158) @mroeschke
  • Use provided memory resource for allocating mixed join results. (#16153) @bdice
  • Run DFG after verify-alpha-spec (#16151) @KyleFromNVIDIA
  • Use size_t to allow large conditional joins (#16127) @bdice
  • Allow only scale=0 fixed-point values in fixedwidthcolumn_wrapper (#16120) @davidwendt
  • Fix pylibcudf Table.num_rows for 0 columns case and add interop to docs (#16108) @lithomas1
  • Add support for proxy np.flatiter objects (#16107) @Matt711
  • Ensure cudf objects can astype to any type when empty (#16106) @mroeschke
  • Support pd.read_pickle and pd.to_pickle in cudf.pandas (#16105) @Matt711
  • Fix unnecessarily strict check in parquet chunked reader for choosing split locations. (#16099) @nvdbaranec
  • Fix is_monotonic_* APIs to include nan&#39;s (#16085) @galipremsagar
  • More safely parse CUDA versions when subprocess output is contaminated (#16067) @brandon-b-miller
  • fastslowproxy: Don't import assert_eq at top-level (#16063) @wence-
  • Prevent bad ColumnAccessor state after .sortindex(axis=1, ignoreindex=True) (#16061) @mroeschke
  • Fix ArrowDeviceArray interface to pass address of event (#16058) @zeroshade
  • Fix a size overflow bug in hash groupby (#16053) @PointKernel
  • Fix atomic_ref scope when multiple blocks are updating the same output (#16051) @vuule
  • Fix initialization error in to_arrow for empty string views (#16033) @wence-
  • Fix the int32 overflow when computing page fragment sizes for large string columns (#16028) @mhaseeb123
  • Fix the pool size alignment issue (#16024) @PointKernel
  • Improve multibyte-split byte-range performance (#16019) @davidwendt
  • Fix target counting in strings char-parallel replace (#16017) @davidwendt
  • Support IntervalDtype in cudf.from_pandas (#16014) @mroeschke
  • Fix memory size in createbyterangeinfosconsecutive (#16012) @davidwendt
  • Hide visibility of non public symbols (#15982) @robertmaynard
  • Fix Cython typo preventing proper inheritance (#15978) @vyasr
  • Fix convertdtypes with convertinteger=False/convert_floating=True (#15964) @mroeschke
  • Fix nunique for MultiIndex, DataFrame, and all NA case with dropna=False (#15962) @mroeschke
  • Explicitly build for all GPU architectures (#15959) @vyasr
  • Preserve column type and class information in more DataFrame operations (#15949) @mroeschke
  • Add array_interface to cudf.pandas numpy.ndarray proxy (#15936) @mroeschke
  • Allow tests to be built when stream util is disabled (#15933) @robertmaynard
  • Fix JSON multi-source reading when total source size exceeds INT_MAX bytes (#15930) @shrshi
  • Fix dask_cudf.read_parquet regression for legacy timestamp data (#15929) @rjzamora
  • Fix offsetalator when accessing over 268 million rows (#15921) @davidwendt
  • Fix debug assert in rowgroupcharcounts_kernel (#15902) @davidwendt
  • Fix categorical conversion from chunked arrow arrays (#15886) @vyasr
  • Handling for NaN and inf when converting floating point to fixed point types (#15885) @ttnghia
  • Manual merge of Branch 24.08 from 24.06 (#15869) @galipremsagar
  • Avoid unnecessary Index cast in IndexedFrame.index setter (#15843) @charlesbluca
  • Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
  • Fix multi-replace target count logic for large strings (#15807) @davidwendt
  • Fix JSON parsing memory corruption - Fix Mixed types nested children removal (#15798) @karthikeyann
  • Allow anonymous user in devcontainer name. (#15784) @bdice
  • Add support for additional metaclasses of proxies and use for ExcelWriter (#15399) @vyasr

πŸ“– Documentation

  • Add docstring for from_dataframe (#16260) @mroeschke
  • Update libcudf compiler requirements in contributing doc (#16103) @davidwendt
  • Add libcudf public/detail API pattern to developer guide (#16086) @davidwendt
  • Explain line profiler and how to know which functions are GPU-accelerated. (#16079) @bdice
  • cudf.pandas documentation improvement (#15948) @Matt711
  • Reland "Fix docs for IO readers and strings_convert" (#15872)" (#15941) @lithomas1
  • Document how to use cudf.pandas in tandem with multiprocessing (#15940) @wence-
  • DOC: Add documentation for cudf.pandas in the Developer Guide (#15889) @Matt711
  • Improve options docs (#15888) @bdice
  • DOC: add linkcode to docs (#15860) @raybellwaves
  • DOC: use intersphinx mapping in pandas-compat ext (#15846) @raybellwaves
  • Fix inconsistent usage of 'results' and 'records' in read-json.md (#15766) @dagardner-nv
  • Update PandasCompat.py to resolve references (#15704) @raybellwaves

πŸš€ New Features

  • Warn on cuDF failure when POLARS_VERBOSE is true (#16308) @brandon-b-miller
  • Add drop_nulls in cudf-polars (#16290) @brandon-b-miller
  • [JNI] Add setKernelPinnedCopyThreshold and setPinnedAllocationThreshold (#16288) @abellina
  • Implement support for scan_ndjson in cudf-polars (#16263) @lithomas1
  • Publish cudf-polars nightlies (#16213) @lithomas1
  • Modify make_host_vector and make_device_uvector factories to optionally use pinned memory and kernel copy (#16206) @vuule
  • Migrate lists/set_operations to pylibcudf (#16190) @Matt711
  • Migrate lists/filling to pylibcudf (#16189) @Matt711
  • Fall back to CPU for unsupported libcudf binaryops in cudf-polars (#16188) @brandon-b-miller
  • Use resourceref for upstream in streamcheckingresourceadaptor (#16187) @harrism
  • Migrate lists/modifying to pylibcudf (#16185) @Matt711
  • Migrate lists/filtering to pylibcudf (#16184) @Matt711
  • Migrate lists/sorting to pylibcudf (#16179) @Matt711
  • Add missing methods to lists/listcolumnview.pxd in pylibcudf (#16175) @Matt711
  • Migrate pylibcudf lists gathering (#16170) @Matt711
  • Move kernel vis over to CUDF_HIDDEN (#16165) @robertmaynard
  • Add groupby_max multi-threaded benchmark (#16154) @srinivasyadav18
  • Promote hasnestedcolumns to cudf public API (#16131) @robertmaynard
  • Promote IO support queries to cudf API (#16125) @robertmaynard
  • cudf::merge public API now support passing a user stream (#16124) @robertmaynard
  • Add TPC-H inspired examples for Libcudf (#16088) @JayjeetAtGithub
  • Installed cudf header use cudf::allocate_like (#16087) @robertmaynard
  • cudf-polars string slicing (#16082) @brandon-b-miller
  • Migrate Parquet reader to pylibcudf (#16078) @lithomas1
  • Migrate lists/count_elements to pylibcudf (#16072) @Matt711
  • Migrate lists/extract to pylibcudf (#16071) @Matt711
  • Move common string utilities to public api (#16070) @robertmaynard
  • stable_distinct public api now has a stream parameter (#16068) @robertmaynard
  • Migrate expressions to pylibcudf (#16056) @lithomas1
  • Add support to ArrowDataSource in SourceInfo (#16050) @lithomas1
  • Experimental support for configurable prefetching (#16020) @vyasr
  • Migrate CSV reader to pylibcudf (#16011) @lithomas1
  • Migrate string slice APIs to pylibcudf (#15988) @brandon-b-miller
  • Migrate lists/contains to pylibcudf (#15981) @Matt711
  • Remove CCCL 2.2 patches as we now always use 2.5+ (#15969) @robertmaynard
  • Migrate JSON reader to pylibcudf (#15966) @lithomas1
  • Add a developer check for proxy objects (#15956) @Matt711
  • Start migrating I/O writers to pylibcudf (starting with JSON) (#15952) @lithomas1
  • Kernel copy for pinned memory (#15934) @vuule
  • Migrate left join and conditional join benchmarks to use nvbench (#15931) @srinivasyadav18
  • Migrate lists/combine to pylibcudf (#15928) @Matt711
  • Plumb pylibcudf strings contains_re through cudf_polars (#15918) @brandon-b-miller
  • Start migrating I/O to pylibcudf (#15899) @lithomas1
  • Pinned vector factory that uses the global pool (#15895) @vuule
  • Migrate strings contains operations to pylibcudf (#15880) @brandon-b-miller
  • Migrate quantile.pxd to pylibcudf (#15874) @lithomas1
  • Migrate round to pylibcudf (#15863) @lithomas1
  • Migrate string replace.pxd to pylibcudf (#15839) @lithomas1
  • Add an Environment Variable for debugging the fast path in cudf.pandas (#15837) @Matt711
  • Add an option to run cuIO benchmarks with pinned buffers as input (#15830) @vuule
  • Update pylibcudf testing utilities (#15772) @brandon-b-miller
  • Migrate string capitalize APIs to pylibcudf (#15503) @brandon-b-miller
  • Add tests for pylibcudf binaryops (#15470) @brandon-b-miller
  • Migrate column factories to pylibcudf (#15257) @brandon-b-miller
  • cuDF/libcudf exponentially weighted moving averages (#9027) @brandon-b-miller

πŸ› οΈ Improvements

  • Ensure objects with interface are converted to cupy/numpy arrays (#16436) @mroeschke
  • Add about rmm modes in cudf.pandas docs (#16404) @galipremsagar
  • Gracefully CUDFFAIL when `skiprows > 0` in Chunked Parquet reader (#16385) @mhaseeb123
  • Make C++ compilation warning free after #16297 (#16379) @wence-
  • Align Index init APIs with pandas 2.x (#16362) @mroeschke
  • Use rapidscpmbsthreadpool() (#16360) @KyleFromNVIDIA
  • Rename PrefetchConfig to prefetch_config. (#16358) @bdice
  • Implement parquet reading using pylibcudf in cudf-polars (#16346) @lithomas1
  • Fix compile warnings with jni_utils.hpp (#16336) @ttnghia
  • Align Series APIs with pandas 2.x (#16333) @mroeschke
  • Add missing stream param to dictionary factory APIs (#16319) @JayjeetAtGithub
  • Mark cudf._typing as a typing module in ruff (#16318) @mroeschke
  • Add stream param to list explode APIs (#16317) @JayjeetAtGithub
  • Fix polars for 1.2.1 (#16316) @lithomas1
  • Use workflow branch 24.08 again (#16314) @KyleFromNVIDIA
  • Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
  • Remove squeeze argument from groupby (#16312) @mroeschke
  • Align more DataFrame APIs with pandas (#16310) @mroeschke
  • Clean unneeded/redudant dtype utils (#16309) @mroeschke
  • Implement read_csv in cudf-polars using pylibcudf (#16307) @lithomas1
  • Use Column.cancastsafely instead of some ad-hoc dtype functions in .where (#16303) @mroeschke
  • Drop {{ pin_compatible(&#39;numpy&#39;, max_pin=&#39;x&#39;) }} (#16301) @jakirkham
  • Host implementation of to_arrow using nanoarrow (#16297) @zeroshade
  • Add ability to prefetch in cudf.pandas and change default to managed pool (#16296) @galipremsagar
  • Fix tests for polars 1.2 (#16292) @lithomas1
  • Introduce dedicated options for low memory readers (#16289) @galipremsagar
  • Remove decimal/floating 64/128bit switches due to register pressure (#16287) @pmattione-nvidia
  • Make ColumnAccessor strictly require a mapping of columns (#16285) @mroeschke
  • Introduce version file so we can conditionally handle things in tests (#16280) @wence-
  • Type & reduce cupy usage (#16277) @mroeschke
  • Update cudf::detail::grid1d to use threadindex_type (#16276) @davidwendt
  • Replace np.isscalar/issubdtype checks with is_scalar/.kind checks (#16275) @mroeschke
  • Remove xml from sortninjalog.py utility (#16274) @davidwendt
  • Fix issue in horizontal concat implementation in cudf-polars (#16271) @wence-
  • Preserve order in left join for cudf-polars (#16268) @wence-
  • Replace isdatetime/timedeltadtype checks with .kind checks (#16262) @mroeschke
  • Replace isfloat/integerdtype checks with .kind checks (#16261) @mroeschke
  • Build and test with CUDA 12.5.1 (#16259) @KyleFromNVIDIA
  • Replace isbooltype with checking .dtype.kind (#16255) @mroeschke
  • remove cuco_noexcept.diff (#16254) @trxcllnt
  • Update contains_tests.cpp to use public cudf::slice (#16253) @davidwendt
  • Improve the test data for pylibcudf I/O tests (#16247) @lithomas1
  • Short circuit some Column methods (#16246) @mroeschke
  • Make nvcomp adapter compatible with new version macros (#16245) @vuule
  • Add Column.strftime/strptime instead of overloading as_string/datetime/timedelta_column (#16243) @mroeschke
  • Remove temporary functor overloads required by cuco version bump (#16242) @PointKernel
  • Remove hashcharacterngrams dependency from jaccard_index (#16241) @davidwendt
  • Expose sorted groupby parameters to pylibcudf (#16240) @wence-
  • Expose reflection to check if casting between two types is supported (#16239) @wence-
  • Handle nans in groupby-aggregations in polars executor (#16233) @wence-
  • Remove mr param from write_csv and write_json (#16231) @JayjeetAtGithub
  • Support Literals in groupby-agg (#16218) @wence-
  • Handler csv reader options in cudf-polars (#16211) @wence-
  • Update vendored thread_pool implementation (#16210) @wence-
  • Add low memory JSON reader for cudf.pandas (#16204) @galipremsagar
  • Clean up state variables in MultiIndex (#16203) @mroeschke
  • skip CMake 3.30.0 (#16202) @jameslamb
  • Assert valid metadata is passed in toarrow for listview (#16198) @wence-
  • Expose type traits to pylibcudf (#16197) @wence-
  • Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
  • Cast count aggs to correct dtype in translation (#16192) @wence-
  • Some small fixes in cudf-polars (#16191) @wence-
  • split up CUDA-suffixed dependencies in dependencies.yaml (#16183) @jameslamb
  • Define PTDS for the stream hook libs (#16182) @trxcllnt
  • Make test_python_cudf_pandas generate requirements.txt (#16181) @trxcllnt
  • Add environment-agnostic ci/run_cudf_polars_pytest.sh (#16178) @trxcllnt
  • Implement translation for some unary functions and a single datetime extraction (#16173) @wence-
  • Remove size constraints on source files in batched JSON reading (#16162) @shrshi
  • CI: Build wheels for cudf-polars (#16156) @lithomas1
  • Update cudf-polars for v1 release of polars (#16149) @wence-
  • Use strings concatenate to support large strings in CSV writer (#16148) @davidwendt
  • Use verify-alpha-spec hook (#16144) @KyleFromNVIDIA
  • Adds write-coalescing code path optimization to FST (#16143) @elstehle
  • MAINT: Adapt to NumPy 2 promotion changes (#16141) @seberg
  • API: Check for integer overflows when creating scalar form python int (#16140) @seberg
  • Remove the (unused) implementation of host_parse_nested_json (#16135) @vuule
  • Deprecate Arrow support in I/O (#16132) @lithomas1
  • Disable dict support for split-page kernel in the parquet reader. (#16128) @nvdbaranec
  • Add throughput metrics for REDUCTIONBENCH/REDUCTIONNVBENCH benchmarks (#16126) @jihoonson
  • Add ensure_index to not unnecessarily shallow copy cudf.Index (#16117) @mroeschke
  • Make binary operators work between fixed-point and floating args (#16116) @pmattione-nvidia
  • Implement Ternary copyifelse (#16114) @wence-
  • Implement handlers for series literal in cudf-polars (#16113) @wence-
  • Fix dtype errors in StringArrays (#16111) @galipremsagar
  • Ensure MultiIndex.to_frame deep copies columns (#16110) @mroeschke
  • Parallelize gpuInitStringDescriptors for fixed length byte array data (#16109) @mhaseeb123
  • Finish implementation of cudf-polars boolean function handlers (#16098) @wence-
  • Expose and then implement support for cross joins in cudf-polars (#16097) @wence-
  • Defer copying in Column.astype(copy=True) (#16095) @mroeschke
  • Fix segfault in conditional join (#16094) @bdice
  • Free temp memory no longer needed in multibyte_split processing (#16091) @davidwendt
  • Rename gather/scatter benchmarks to clarify coalesced behavior. (#16083) @bdice
  • Adapt to polars upstream changes and turn on CI testing (#16081) @wence-
  • Reduce/clean copy usage in Series, reshaping (#16080) @mroeschke
  • Account for FIXEDLENBYTE_ARRAY when calculating fragment sizes in Parquet writer (#16064) @etseidl
  • Reduce (shallow) copies in DataFrame ops (#16060) @mroeschke
  • Add multi-file support to dask_cudf.read_json (#16057) @rjzamora
  • Reduce deep copies in Index ops (#16054) @mroeschke
  • Implement chunked column wise concat in chunked parquet reader (#16052) @galipremsagar
  • Add exception when trying to create large strings with cudf::test::stringscolumnwrapper (#16049) @davidwendt
  • Return FrozenList for Index.names (#16047) @galipremsagar
  • Add ast cast test (#16045) @pmattione-nvidia
  • Remove override_dtypes and include_index from Frame._copy_type_metadata (#16043) @mroeschke
  • Add ruff rules to avoid importing from typing (#16040) @mroeschke
  • Fix decimal -> float cast in ast code (#16038) @pmattione-nvidia
  • Add compile option to enable large strings support (#16037) @davidwendt
  • Reduce conditional_join nvbench configurations (#16036) @srinivasyadav18
  • Project automation update: skip if not in project (#16035) @jarmak-nv
  • Add stream parameter to cudf::io::text::multibyte_split (#16034) @davidwendt
  • Delete unused code from stringfunction evaluator (#16032) @wence-
  • Fix exclude regex in pre-commit clang-format hook (#16030) @wence-
  • Refactor rmm usage in cudf.pandas (#16021) @galipremsagar
  • Enable ruff TCH: typing imports under if TYPE_CHECKING (#16015) @mroeschke
  • Restrict the allowed pandas timezone objects in cudf (#16013) @mroeschke
  • orc multithreaded benchmark (#16009) @zpuller
  • Add tests of expression-based sort and sort-by (#16008) @wence-
  • Add tests of implemented StringFunctions (#16007) @wence-
  • Add test that diagonal concat with mismatching schemas raises (#16006) @wence-
  • Add coverage selecting len from a dataframe (number of rows) (#16005) @wence-
  • Add basic tests of dataframe scan (#16003) @wence-
  • Add coverage for both expression and dataframe filter (#16002) @wence-
  • Remove deprecated ExtContext node (#16001) @wence-
  • Fix typo bug in gather implementation (#16000) @wence-
  • Extend coverage of groupby and rolling window nodes (#15999) @wence-
  • Coverage of binops where one or both operands are a scalar (#15998) @wence-
  • Add full coverage for whole-frame Agg expressions (#15997) @wence-
  • Add tests covering magic methods of Expr objects (#15996) @wence-
  • Add full coverage of utility functions (#15995) @wence-
  • Test behaviour of containers (#15994) @wence-
  • Fix implemention of any, all, and isbetween (#15993) @wence-
  • Raise early on unhandled PythonScan node (#15992) @wence-
  • Remove mapfunction nodes that don't exist/aren't supported (#15991) @wence-
  • Add test coverage for slicing with "out of bounds" negative indices (#15990) @wence-
  • Standardize and type Series.dt methods (#15987) @mroeschke
  • Refactor distinct with hashset-based algorithms (#15984) @srinivasyadav18
  • resolve dependency-file-generator warning, remove unnecessary rapids-build-backend configuration (#15980) @jameslamb
  • Project automation bug fixes (#15971) @jarmak-nv
  • Add typing to singlecolumnframe (#15965) @mroeschke
  • Move some misc Frame methods to appropriate locations (#15963) @mroeschke
  • Condense pylibcudf data fixtures (#15958) @lithomas1
  • Refactor fillna logic to push specifics toward Frame subclasses and Column subclasses (#15957) @mroeschke
  • Remove unused parsing utilities (#15955) @vuule
  • Remove Scalar container type from polars interpreter (#15953) @wence-
  • Support arbitrary CUDA versions in UDF code (#15950) @bdice
  • Support large strings in cudf::io::text::multibyte_split (#15947) @davidwendt
  • Add external issue label and project automation (#15945) @jarmak-nv
  • Enable round-tripping of large strings in cudf (#15944) @galipremsagar
  • Add more complete type annotations in polars interpreter (#15942) @wence-
  • Update implementations to build with the latest cuco (#15938) @PointKernel
  • Support timezone aware pandas inputs in cudf (#15935) @mroeschke
  • Define Column.nanasnull to return self (#15923) @mroeschke
  • Make Frame._dtype an iterator instead of a dict (#15920) @mroeschke
  • Port start of datetime.hpp to pylibcudf (#15916) @wence-
  • Introduce NamedColumn concept in cudf-polars (#15914) @wence-
  • Avoid redefining Frame.getcolumnsbylabel in subclasses (#15912) @mroeschke
  • Templatization of fixed-width parquet decoding kernels. (#15911) @nvdbaranec
  • New Decimal <--> Floating conversion (#15905) @pmattione-nvidia
  • Use Arrow C Data Interface functions for Python interop (#15904) @vyasr
  • Use offsetalator in cudf::io::json::detail::parse_string (#15900) @davidwendt
  • Rename strings multiple target replace API (#15898) @davidwendt
  • Apply clang-tidy autofixes (#15894) @vyasr
  • Update Python labels and remove unnecessary ones (#15893) @vyasr
  • Clean up pylibcudf test assertations (#15892) @lithomas1
  • Use offsetalator in orc rowgroupcharcounts_kernel (#15891) @davidwendt
  • Ensure literals have correct dtype (#15890) @wence-
  • Add overflow check when converting large strings to lists columns (#15887) @davidwendt
  • Use offsetalator in nvtext::tokenizewithvocabulary (#15878) @davidwendt
  • Update interleave lists column for large strings (#15877) @davidwendt
  • Simple NumPy 2 fixes that are clearly no behavior change (#15876) @seberg
  • Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow (#15875) @mhaseeb123
  • Refactor join benchmarks to target public APIs with the default stream (#15873) @PointKernel
  • Fix url-decode benchmark to use offsetalator (#15871) @davidwendt
  • Use offsetalator in strings shift functor (#15870) @davidwendt
  • Memory Profiling (#15866) @madsbk
  • Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
  • Make Frame.astype return Self instead of a ColumnAccessor (#15861) @mroeschke
  • Use ColumnAccessor row and column length attributes more consistently (#15857) @mroeschke
  • add unit test setup for cudf_kafka (#15853) @jameslamb
  • Remove internal usage of core.index.as_index in favor of cudf.Index (#15851) @mroeschke
  • Ensure cudf.Series(cudf.Series(...)) creates a reference to the same index (#15845) @mroeschke
  • Remove benchmark-specific use of pinned-pooled memory in Parquet multithreaded benchmark. (#15838) @nvdbaranec
  • Implement on_bad_lines in json reader (#15834) @galipremsagar
  • Make Column.to_pandas return Index instead of Series (#15833) @mroeschke
  • Add test of interoperability of cuDF and arrow BYTESTREAMSPLIT encoders (#15832) @etseidl
  • Refactor Parquet writer options and builders (#15831) @etseidl
  • Migrate reshape.pxd to pylibcudf (#15827) @lithomas1
  • Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice
  • Switch cuIO benchmarks to use pinned-pool host allocations by default. (#15805) @nvdbaranec
  • Change thrust::count_if call to raw kernel in strings split APIs (#15762) @davidwendt
  • Improve performance for long strings for nvtext::replace_tokens (#15756) @davidwendt
  • Implement chunked parquet reader in cudf-python (#15728) @galipremsagar
  • Add from_arrow_host functions for cudf interop with nanoarrow (#15645) @zeroshade
  • Add ability to enable rmm pool on cudf.pandas import (#15628) @galipremsagar
  • Executor for polars logical plans (#15504) @wence-
  • Implement dayname and monthname to match pandas (#15479) @btepera
  • Utilities for decimal <--> floating conversion (#15359) @pmattione-nvidia
  • For powers of 10, replace ipow with switch (#15353) @pmattione-nvidia
  • Use rapids-build-backend. (#15245) @vyasr
  • Add codecov coverage for pandas_tests (#14513) @galipremsagar

- C++
Published by raydouglass over 1 year ago

https://github.com/rapidsai/cudf - v24.08.00

🚨 Breaking Changes

  • Align Index init APIs with pandas 2.x (#16362) @mroeschke
  • Align Series APIs with pandas 2.x (#16333) @mroeschke
  • Add missing stream param to dictionary factory APIs (#16319) @JayjeetAtGithub
  • Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
  • Remove squeeze argument from groupby (#16312) @mroeschke
  • Align more DataFrame APIs with pandas (#16310) @mroeschke
  • Remove mr param from write_csv and write_json (#16231) @JayjeetAtGithub
  • Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
  • Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
  • Deprecate Arrow support in I/O (#16132) @lithomas1
  • Return FrozenList for Index.names (#16047) @galipremsagar
  • Add compile option to enable large strings support (#16037) @davidwendt
  • Hide visibility of non public symbols (#15982) @robertmaynard
  • Rename strings multiple target replace API (#15898) @davidwendt
  • Pinned vector factory that uses the global pool (#15895) @vuule
  • Apply clang-tidy autofixes (#15894) @vyasr
  • Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow (#15875) @mhaseeb123
  • Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
  • Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
  • Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice

πŸ› Bug Fixes

  • Add flatbuffers to libcudf build (#16446) @galipremsagar
  • Fix parquetfieldlist read_func lambda capture invalid this pointer (#16440) @davidwendt
  • Enable prefetching in cudf.pandas.install() (#16439) @bdice
  • Enable prefetching before runpy (#16427) @galipremsagar
  • Support thread-safe for prefetch_config::get and prefetch_config::set (#16425) @ttnghia
  • Fix a pandas-2.0 missing attribute error (#16416) @galipremsagar
  • [Bug] Remove loud NativeFile deprecation noise for read_parquet from S3 (#16415) @rjzamora
  • Fix nightly memcheck error for empty STREAMINTEROPTEST (#16406) @davidwendt
  • Gate ArrowStringArrayNumpySemantics cudf.pandas proxy behind version check (#16401) @mroeschke
  • Don't export bsthreadpool (#16398) @KyleFromNVIDIA
  • Require fixed width types for casting in cudf-polars (#16381) @brandon-b-miller
  • Fix docstring of DataFrame.apply (#16351) @galipremsagar
  • Make bool raise for more cudf objects (#16311) @mroeschke
  • Rename .devcontainers for CUDA 12.5 (#16293) @jakirkham
  • Fix split_record for all empty strings column (#16291) @davidwendt
  • Fix logic in to_arrow for empty list column (#16279) @wence-
  • [BUG] Make name attr of Index fast slow attrs (#16270) @Matt711
  • Add custom name setter and getter for proxy objects in cudf.pandas (#16234) @Matt711
  • Fall back when casting a timestamp to numeric in cudf-polars (#16232) @brandon-b-miller
  • Disable large string support for Java build (#16216) @jlowe
  • Remove CCCL patch for PR 211. (#16207) @bdice
  • Add single offset to an empty ListArray in cudf::to_arrow (#16201) @davidwendt
  • Fix memory_usage when calculating nested list column (#16193) @mroeschke
  • Support at/iat indexers in cudf.pandas (#16177) @mroeschke
  • Fix unused-return-value debug build error in fromarrowstream_test.cpp (#16168) @davidwendt
  • Fix cudf::strings::replace_multiple hang on empty target (#16167) @davidwendt
  • Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
  • interpolate returns new column if no values are interpolated (#16158) @mroeschke
  • Use provided memory resource for allocating mixed join results. (#16153) @bdice
  • Run DFG after verify-alpha-spec (#16151) @KyleFromNVIDIA
  • Use size_t to allow large conditional joins (#16127) @bdice
  • Allow only scale=0 fixed-point values in fixedwidthcolumn_wrapper (#16120) @davidwendt
  • Fix pylibcudf Table.num_rows for 0 columns case and add interop to docs (#16108) @lithomas1
  • Add support for proxy np.flatiter objects (#16107) @Matt711
  • Ensure cudf objects can astype to any type when empty (#16106) @mroeschke
  • Support pd.read_pickle and pd.to_pickle in cudf.pandas (#16105) @Matt711
  • Fix unnecessarily strict check in parquet chunked reader for choosing split locations. (#16099) @nvdbaranec
  • Fix is_monotonic_* APIs to include nan&#39;s (#16085) @galipremsagar
  • More safely parse CUDA versions when subprocess output is contaminated (#16067) @brandon-b-miller
  • fastslowproxy: Don't import assert_eq at top-level (#16063) @wence-
  • Prevent bad ColumnAccessor state after .sortindex(axis=1, ignoreindex=True) (#16061) @mroeschke
  • Fix ArrowDeviceArray interface to pass address of event (#16058) @zeroshade
  • Fix a size overflow bug in hash groupby (#16053) @PointKernel
  • Fix atomic_ref scope when multiple blocks are updating the same output (#16051) @vuule
  • Fix initialization error in to_arrow for empty string views (#16033) @wence-
  • Fix the int32 overflow when computing page fragment sizes for large string columns (#16028) @mhaseeb123
  • Fix the pool size alignment issue (#16024) @PointKernel
  • Improve multibyte-split byte-range performance (#16019) @davidwendt
  • Fix target counting in strings char-parallel replace (#16017) @davidwendt
  • Support IntervalDtype in cudf.from_pandas (#16014) @mroeschke
  • Fix memory size in createbyterangeinfosconsecutive (#16012) @davidwendt
  • Hide visibility of non public symbols (#15982) @robertmaynard
  • Fix Cython typo preventing proper inheritance (#15978) @vyasr
  • Fix convertdtypes with convertinteger=False/convert_floating=True (#15964) @mroeschke
  • Fix nunique for MultiIndex, DataFrame, and all NA case with dropna=False (#15962) @mroeschke
  • Explicitly build for all GPU architectures (#15959) @vyasr
  • Preserve column type and class information in more DataFrame operations (#15949) @mroeschke
  • Add array_interface to cudf.pandas numpy.ndarray proxy (#15936) @mroeschke
  • Allow tests to be built when stream util is disabled (#15933) @robertmaynard
  • Fix JSON multi-source reading when total source size exceeds INT_MAX bytes (#15930) @shrshi
  • Fix dask_cudf.read_parquet regression for legacy timestamp data (#15929) @rjzamora
  • Fix offsetalator when accessing over 268 million rows (#15921) @davidwendt
  • Fix debug assert in rowgroupcharcounts_kernel (#15902) @davidwendt
  • Fix categorical conversion from chunked arrow arrays (#15886) @vyasr
  • Handling for NaN and inf when converting floating point to fixed point types (#15885) @ttnghia
  • Manual merge of Branch 24.08 from 24.06 (#15869) @galipremsagar
  • Avoid unnecessary Index cast in IndexedFrame.index setter (#15843) @charlesbluca
  • Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
  • Fix multi-replace target count logic for large strings (#15807) @davidwendt
  • Fix JSON parsing memory corruption - Fix Mixed types nested children removal (#15798) @karthikeyann
  • Allow anonymous user in devcontainer name. (#15784) @bdice
  • Add support for additional metaclasses of proxies and use for ExcelWriter (#15399) @vyasr

πŸ“– Documentation

  • Add docstring for from_dataframe (#16260) @mroeschke
  • Update libcudf compiler requirements in contributing doc (#16103) @davidwendt
  • Add libcudf public/detail API pattern to developer guide (#16086) @davidwendt
  • Explain line profiler and how to know which functions are GPU-accelerated. (#16079) @bdice
  • cudf.pandas documentation improvement (#15948) @Matt711
  • Reland "Fix docs for IO readers and strings_convert" (#15872)" (#15941) @lithomas1
  • Document how to use cudf.pandas in tandem with multiprocessing (#15940) @wence-
  • DOC: Add documentation for cudf.pandas in the Developer Guide (#15889) @Matt711
  • Improve options docs (#15888) @bdice
  • DOC: add linkcode to docs (#15860) @raybellwaves
  • DOC: use intersphinx mapping in pandas-compat ext (#15846) @raybellwaves
  • Fix inconsistent usage of 'results' and 'records' in read-json.md (#15766) @dagardner-nv
  • Update PandasCompat.py to resolve references (#15704) @raybellwaves

πŸš€ New Features

  • Warn on cuDF failure when POLARS_VERBOSE is true (#16308) @brandon-b-miller
  • Add drop_nulls in cudf-polars (#16290) @brandon-b-miller
  • [JNI] Add setKernelPinnedCopyThreshold and setPinnedAllocationThreshold (#16288) @abellina
  • Implement support for scan_ndjson in cudf-polars (#16263) @lithomas1
  • Publish cudf-polars nightlies (#16213) @lithomas1
  • Modify make_host_vector and make_device_uvector factories to optionally use pinned memory and kernel copy (#16206) @vuule
  • Migrate lists/set_operations to pylibcudf (#16190) @Matt711
  • Migrate lists/filling to pylibcudf (#16189) @Matt711
  • Fall back to CPU for unsupported libcudf binaryops in cudf-polars (#16188) @brandon-b-miller
  • Use resourceref for upstream in streamcheckingresourceadaptor (#16187) @harrism
  • Migrate lists/modifying to pylibcudf (#16185) @Matt711
  • Migrate lists/filtering to pylibcudf (#16184) @Matt711
  • Migrate lists/sorting to pylibcudf (#16179) @Matt711
  • Add missing methods to lists/listcolumnview.pxd in pylibcudf (#16175) @Matt711
  • Migrate pylibcudf lists gathering (#16170) @Matt711
  • Move kernel vis over to CUDF_HIDDEN (#16165) @robertmaynard
  • Add groupby_max multi-threaded benchmark (#16154) @srinivasyadav18
  • Promote hasnestedcolumns to cudf public API (#16131) @robertmaynard
  • Promote IO support queries to cudf API (#16125) @robertmaynard
  • cudf::merge public API now support passing a user stream (#16124) @robertmaynard
  • Add TPC-H inspired examples for Libcudf (#16088) @JayjeetAtGithub
  • Installed cudf header use cudf::allocate_like (#16087) @robertmaynard
  • cudf-polars string slicing (#16082) @brandon-b-miller
  • Migrate Parquet reader to pylibcudf (#16078) @lithomas1
  • Migrate lists/count_elements to pylibcudf (#16072) @Matt711
  • Migrate lists/extract to pylibcudf (#16071) @Matt711
  • Move common string utilities to public api (#16070) @robertmaynard
  • stable_distinct public api now has a stream parameter (#16068) @robertmaynard
  • Migrate expressions to pylibcudf (#16056) @lithomas1
  • Add support to ArrowDataSource in SourceInfo (#16050) @lithomas1
  • Experimental support for configurable prefetching (#16020) @vyasr
  • Migrate CSV reader to pylibcudf (#16011) @lithomas1
  • Migrate string slice APIs to pylibcudf (#15988) @brandon-b-miller
  • Migrate lists/contains to pylibcudf (#15981) @Matt711
  • Remove CCCL 2.2 patches as we now always use 2.5+ (#15969) @robertmaynard
  • Migrate JSON reader to pylibcudf (#15966) @lithomas1
  • Add a developer check for proxy objects (#15956) @Matt711
  • Start migrating I/O writers to pylibcudf (starting with JSON) (#15952) @lithomas1
  • Kernel copy for pinned memory (#15934) @vuule
  • Migrate left join and conditional join benchmarks to use nvbench (#15931) @srinivasyadav18
  • Migrate lists/combine to pylibcudf (#15928) @Matt711
  • Plumb pylibcudf strings contains_re through cudf_polars (#15918) @brandon-b-miller
  • Start migrating I/O to pylibcudf (#15899) @lithomas1
  • Pinned vector factory that uses the global pool (#15895) @vuule
  • Migrate strings contains operations to pylibcudf (#15880) @brandon-b-miller
  • Migrate quantile.pxd to pylibcudf (#15874) @lithomas1
  • Migrate round to pylibcudf (#15863) @lithomas1
  • Migrate string replace.pxd to pylibcudf (#15839) @lithomas1
  • Add an Environment Variable for debugging the fast path in cudf.pandas (#15837) @Matt711
  • Add an option to run cuIO benchmarks with pinned buffers as input (#15830) @vuule
  • Update pylibcudf testing utilities (#15772) @brandon-b-miller
  • Migrate string capitalize APIs to pylibcudf (#15503) @brandon-b-miller
  • Add tests for pylibcudf binaryops (#15470) @brandon-b-miller
  • Migrate column factories to pylibcudf (#15257) @brandon-b-miller
  • cuDF/libcudf exponentially weighted moving averages (#9027) @brandon-b-miller

πŸ› οΈ Improvements

  • Ensure objects with interface are converted to cupy/numpy arrays (#16436) @mroeschke
  • Add about rmm modes in cudf.pandas docs (#16404) @galipremsagar
  • Gracefully CUDFFAIL when `skiprows > 0` in Chunked Parquet reader (#16385) @mhaseeb123
  • Make C++ compilation warning free after #16297 (#16379) @wence-
  • Align Index init APIs with pandas 2.x (#16362) @mroeschke
  • Use rapidscpmbsthreadpool() (#16360) @KyleFromNVIDIA
  • Rename PrefetchConfig to prefetch_config. (#16358) @bdice
  • Implement parquet reading using pylibcudf in cudf-polars (#16346) @lithomas1
  • Fix compile warnings with jni_utils.hpp (#16336) @ttnghia
  • Align Series APIs with pandas 2.x (#16333) @mroeschke
  • Add missing stream param to dictionary factory APIs (#16319) @JayjeetAtGithub
  • Mark cudf._typing as a typing module in ruff (#16318) @mroeschke
  • Add stream param to list explode APIs (#16317) @JayjeetAtGithub
  • Fix polars for 1.2.1 (#16316) @lithomas1
  • Use workflow branch 24.08 again (#16314) @KyleFromNVIDIA
  • Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
  • Remove squeeze argument from groupby (#16312) @mroeschke
  • Align more DataFrame APIs with pandas (#16310) @mroeschke
  • Clean unneeded/redudant dtype utils (#16309) @mroeschke
  • Implement read_csv in cudf-polars using pylibcudf (#16307) @lithomas1
  • Use Column.cancastsafely instead of some ad-hoc dtype functions in .where (#16303) @mroeschke
  • Drop {{ pin_compatible(&#39;numpy&#39;, max_pin=&#39;x&#39;) }} (#16301) @jakirkham
  • Host implementation of to_arrow using nanoarrow (#16297) @zeroshade
  • Add ability to prefetch in cudf.pandas and change default to managed pool (#16296) @galipremsagar
  • Fix tests for polars 1.2 (#16292) @lithomas1
  • Introduce dedicated options for low memory readers (#16289) @galipremsagar
  • Remove decimal/floating 64/128bit switches due to register pressure (#16287) @pmattione-nvidia
  • Make ColumnAccessor strictly require a mapping of columns (#16285) @mroeschke
  • Introduce version file so we can conditionally handle things in tests (#16280) @wence-
  • Type & reduce cupy usage (#16277) @mroeschke
  • Update cudf::detail::grid1d to use threadindex_type (#16276) @davidwendt
  • Replace np.isscalar/issubdtype checks with is_scalar/.kind checks (#16275) @mroeschke
  • Remove xml from sortninjalog.py utility (#16274) @davidwendt
  • Fix issue in horizontal concat implementation in cudf-polars (#16271) @wence-
  • Preserve order in left join for cudf-polars (#16268) @wence-
  • Replace isdatetime/timedeltadtype checks with .kind checks (#16262) @mroeschke
  • Replace isfloat/integerdtype checks with .kind checks (#16261) @mroeschke
  • Build and test with CUDA 12.5.1 (#16259) @KyleFromNVIDIA
  • Replace isbooltype with checking .dtype.kind (#16255) @mroeschke
  • remove cuco_noexcept.diff (#16254) @trxcllnt
  • Update contains_tests.cpp to use public cudf::slice (#16253) @davidwendt
  • Improve the test data for pylibcudf I/O tests (#16247) @lithomas1
  • Short circuit some Column methods (#16246) @mroeschke
  • Make nvcomp adapter compatible with new version macros (#16245) @vuule
  • Add Column.strftime/strptime instead of overloading as_string/datetime/timedelta_column (#16243) @mroeschke
  • Remove temporary functor overloads required by cuco version bump (#16242) @PointKernel
  • Remove hashcharacterngrams dependency from jaccard_index (#16241) @davidwendt
  • Expose sorted groupby parameters to pylibcudf (#16240) @wence-
  • Expose reflection to check if casting between two types is supported (#16239) @wence-
  • Handle nans in groupby-aggregations in polars executor (#16233) @wence-
  • Remove mr param from write_csv and write_json (#16231) @JayjeetAtGithub
  • Support Literals in groupby-agg (#16218) @wence-
  • Handler csv reader options in cudf-polars (#16211) @wence-
  • Update vendored thread_pool implementation (#16210) @wence-
  • Add low memory JSON reader for cudf.pandas (#16204) @galipremsagar
  • Clean up state variables in MultiIndex (#16203) @mroeschke
  • skip CMake 3.30.0 (#16202) @jameslamb
  • Assert valid metadata is passed in toarrow for listview (#16198) @wence-
  • Expose type traits to pylibcudf (#16197) @wence-
  • Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
  • Cast count aggs to correct dtype in translation (#16192) @wence-
  • Some small fixes in cudf-polars (#16191) @wence-
  • split up CUDA-suffixed dependencies in dependencies.yaml (#16183) @jameslamb
  • Define PTDS for the stream hook libs (#16182) @trxcllnt
  • Make test_python_cudf_pandas generate requirements.txt (#16181) @trxcllnt
  • Add environment-agnostic ci/run_cudf_polars_pytest.sh (#16178) @trxcllnt
  • Implement translation for some unary functions and a single datetime extraction (#16173) @wence-
  • Remove size constraints on source files in batched JSON reading (#16162) @shrshi
  • CI: Build wheels for cudf-polars (#16156) @lithomas1
  • Update cudf-polars for v1 release of polars (#16149) @wence-
  • Use strings concatenate to support large strings in CSV writer (#16148) @davidwendt
  • Use verify-alpha-spec hook (#16144) @KyleFromNVIDIA
  • Adds write-coalescing code path optimization to FST (#16143) @elstehle
  • MAINT: Adapt to NumPy 2 promotion changes (#16141) @seberg
  • API: Check for integer overflows when creating scalar form python int (#16140) @seberg
  • Remove the (unused) implementation of host_parse_nested_json (#16135) @vuule
  • Deprecate Arrow support in I/O (#16132) @lithomas1
  • Disable dict support for split-page kernel in the parquet reader. (#16128) @nvdbaranec
  • Add throughput metrics for REDUCTIONBENCH/REDUCTIONNVBENCH benchmarks (#16126) @jihoonson
  • Add ensure_index to not unnecessarily shallow copy cudf.Index (#16117) @mroeschke
  • Make binary operators work between fixed-point and floating args (#16116) @pmattione-nvidia
  • Implement Ternary copyifelse (#16114) @wence-
  • Implement handlers for series literal in cudf-polars (#16113) @wence-
  • Fix dtype errors in StringArrays (#16111) @galipremsagar
  • Ensure MultiIndex.to_frame deep copies columns (#16110) @mroeschke
  • Parallelize gpuInitStringDescriptors for fixed length byte array data (#16109) @mhaseeb123
  • Finish implementation of cudf-polars boolean function handlers (#16098) @wence-
  • Expose and then implement support for cross joins in cudf-polars (#16097) @wence-
  • Defer copying in Column.astype(copy=True) (#16095) @mroeschke
  • Fix segfault in conditional join (#16094) @bdice
  • Free temp memory no longer needed in multibyte_split processing (#16091) @davidwendt
  • Rename gather/scatter benchmarks to clarify coalesced behavior. (#16083) @bdice
  • Adapt to polars upstream changes and turn on CI testing (#16081) @wence-
  • Reduce/clean copy usage in Series, reshaping (#16080) @mroeschke
  • Account for FIXEDLENBYTE_ARRAY when calculating fragment sizes in Parquet writer (#16064) @etseidl
  • Reduce (shallow) copies in DataFrame ops (#16060) @mroeschke
  • Add multi-file support to dask_cudf.read_json (#16057) @rjzamora
  • Reduce deep copies in Index ops (#16054) @mroeschke
  • Implement chunked column wise concat in chunked parquet reader (#16052) @galipremsagar
  • Add exception when trying to create large strings with cudf::test::stringscolumnwrapper (#16049) @davidwendt
  • Return FrozenList for Index.names (#16047) @galipremsagar
  • Add ast cast test (#16045) @pmattione-nvidia
  • Remove override_dtypes and include_index from Frame._copy_type_metadata (#16043) @mroeschke
  • Add ruff rules to avoid importing from typing (#16040) @mroeschke
  • Fix decimal -> float cast in ast code (#16038) @pmattione-nvidia
  • Add compile option to enable large strings support (#16037) @davidwendt
  • Reduce conditional_join nvbench configurations (#16036) @srinivasyadav18
  • Project automation update: skip if not in project (#16035) @jarmak-nv
  • Add stream parameter to cudf::io::text::multibyte_split (#16034) @davidwendt
  • Delete unused code from stringfunction evaluator (#16032) @wence-
  • Fix exclude regex in pre-commit clang-format hook (#16030) @wence-
  • Refactor rmm usage in cudf.pandas (#16021) @galipremsagar
  • Enable ruff TCH: typing imports under if TYPE_CHECKING (#16015) @mroeschke
  • Restrict the allowed pandas timezone objects in cudf (#16013) @mroeschke
  • orc multithreaded benchmark (#16009) @zpuller
  • Add tests of expression-based sort and sort-by (#16008) @wence-
  • Add tests of implemented StringFunctions (#16007) @wence-
  • Add test that diagonal concat with mismatching schemas raises (#16006) @wence-
  • Add coverage selecting len from a dataframe (number of rows) (#16005) @wence-
  • Add basic tests of dataframe scan (#16003) @wence-
  • Add coverage for both expression and dataframe filter (#16002) @wence-
  • Remove deprecated ExtContext node (#16001) @wence-
  • Fix typo bug in gather implementation (#16000) @wence-
  • Extend coverage of groupby and rolling window nodes (#15999) @wence-
  • Coverage of binops where one or both operands are a scalar (#15998) @wence-
  • Add full coverage for whole-frame Agg expressions (#15997) @wence-
  • Add tests covering magic methods of Expr objects (#15996) @wence-
  • Add full coverage of utility functions (#15995) @wence-
  • Test behaviour of containers (#15994) @wence-
  • Fix implemention of any, all, and isbetween (#15993) @wence-
  • Raise early on unhandled PythonScan node (#15992) @wence-
  • Remove mapfunction nodes that don't exist/aren't supported (#15991) @wence-
  • Add test coverage for slicing with "out of bounds" negative indices (#15990) @wence-
  • Standardize and type Series.dt methods (#15987) @mroeschke
  • Refactor distinct with hashset-based algorithms (#15984) @srinivasyadav18
  • resolve dependency-file-generator warning, remove unnecessary rapids-build-backend configuration (#15980) @jameslamb
  • Project automation bug fixes (#15971) @jarmak-nv
  • Add typing to singlecolumnframe (#15965) @mroeschke
  • Move some misc Frame methods to appropriate locations (#15963) @mroeschke
  • Condense pylibcudf data fixtures (#15958) @lithomas1
  • Refactor fillna logic to push specifics toward Frame subclasses and Column subclasses (#15957) @mroeschke
  • Remove unused parsing utilities (#15955) @vuule
  • Remove Scalar container type from polars interpreter (#15953) @wence-
  • Support arbitrary CUDA versions in UDF code (#15950) @bdice
  • Support large strings in cudf::io::text::multibyte_split (#15947) @davidwendt
  • Add external issue label and project automation (#15945) @jarmak-nv
  • Enable round-tripping of large strings in cudf (#15944) @galipremsagar
  • Add more complete type annotations in polars interpreter (#15942) @wence-
  • Update implementations to build with the latest cuco (#15938) @PointKernel
  • Support timezone aware pandas inputs in cudf (#15935) @mroeschke
  • Define Column.nanasnull to return self (#15923) @mroeschke
  • Make Frame._dtype an iterator instead of a dict (#15920) @mroeschke
  • Port start of datetime.hpp to pylibcudf (#15916) @wence-
  • Introduce NamedColumn concept in cudf-polars (#15914) @wence-
  • Avoid redefining Frame.getcolumnsbylabel in subclasses (#15912) @mroeschke
  • Templatization of fixed-width parquet decoding kernels. (#15911) @nvdbaranec
  • New Decimal <--> Floating conversion (#15905) @pmattione-nvidia
  • Use Arrow C Data Interface functions for Python interop (#15904) @vyasr
  • Use offsetalator in cudf::io::json::detail::parse_string (#15900) @davidwendt
  • Rename strings multiple target replace API (#15898) @davidwendt
  • Apply clang-tidy autofixes (#15894) @vyasr
  • Update Python labels and remove unnecessary ones (#15893) @vyasr
  • Clean up pylibcudf test assertations (#15892) @lithomas1
  • Use offsetalator in orc rowgroupcharcounts_kernel (#15891) @davidwendt
  • Ensure literals have correct dtype (#15890) @wence-
  • Add overflow check when converting large strings to lists columns (#15887) @davidwendt
  • Use offsetalator in nvtext::tokenizewithvocabulary (#15878) @davidwendt
  • Update interleave lists column for large strings (#15877) @davidwendt
  • Simple NumPy 2 fixes that are clearly no behavior change (#15876) @seberg
  • Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow (#15875) @mhaseeb123
  • Refactor join benchmarks to target public APIs with the default stream (#15873) @PointKernel
  • Fix url-decode benchmark to use offsetalator (#15871) @davidwendt
  • Use offsetalator in strings shift functor (#15870) @davidwendt
  • Memory Profiling (#15866) @madsbk
  • Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
  • Make Frame.astype return Self instead of a ColumnAccessor (#15861) @mroeschke
  • Use ColumnAccessor row and column length attributes more consistently (#15857) @mroeschke
  • add unit test setup for cudf_kafka (#15853) @jameslamb
  • Remove internal usage of core.index.as_index in favor of cudf.Index (#15851) @mroeschke
  • Ensure cudf.Series(cudf.Series(...)) creates a reference to the same index (#15845) @mroeschke
  • Remove benchmark-specific use of pinned-pooled memory in Parquet multithreaded benchmark. (#15838) @nvdbaranec
  • Implement on_bad_lines in json reader (#15834) @galipremsagar
  • Make Column.to_pandas return Index instead of Series (#15833) @mroeschke
  • Add test of interoperability of cuDF and arrow BYTESTREAMSPLIT encoders (#15832) @etseidl
  • Refactor Parquet writer options and builders (#15831) @etseidl
  • Migrate reshape.pxd to pylibcudf (#15827) @lithomas1
  • Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice
  • Switch cuIO benchmarks to use pinned-pool host allocations by default. (#15805) @nvdbaranec
  • Change thrust::count_if call to raw kernel in strings split APIs (#15762) @davidwendt
  • Improve performance for long strings for nvtext::replace_tokens (#15756) @davidwendt
  • Implement chunked parquet reader in cudf-python (#15728) @galipremsagar
  • Add from_arrow_host functions for cudf interop with nanoarrow (#15645) @zeroshade
  • Add ability to enable rmm pool on cudf.pandas import (#15628) @galipremsagar
  • Executor for polars logical plans (#15504) @wence-
  • Implement dayname and monthname to match pandas (#15479) @btepera
  • Utilities for decimal <--> floating conversion (#15359) @pmattione-nvidia
  • For powers of 10, replace ipow with switch (#15353) @pmattione-nvidia
  • Use rapids-build-backend. (#15245) @vyasr
  • Add codecov coverage for pandas_tests (#14513) @galipremsagar

- C++
Published by raydouglass over 1 year ago

https://github.com/rapidsai/cudf - v24.06.01

🚨 Breaking Changes

  • Deprecate Groupby.collect (#15808) @galipremsagar
  • Raise FileNotFoundError when a literal JSON string that looks like a json filename is passed (#15806) @lithomas1
  • Support filtered I/O in chunked_parquet_reader and simplify the use of parquet_reader_options (#15764) @mhaseeb123
  • Raise errors for unsupported operations on certain types (#15712) @galipremsagar
  • Support DurationType in cudf parquet reader via arrow:schema (#15617) @mhaseeb123
  • Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
  • Remove legacy JSON reader from Python (#15538) @bdice
  • Removing all batching code from parquet writer (#15528) @mhaseeb123
  • Convert libcudf resource parameters to rmm::deviceasyncresource_ref (#15507) @harrism
  • Remove deprecated strings offsets_begin (#15454) @davidwendt
  • Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
  • Bind read_parquet_metadata API to libcudf instead of pyarrow and extract RowGroup information (#15398) @mhaseeb123
  • Remove deprecated hash() and sparkmurmurhash3x86_32() (#15375) @davidwendt
  • Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
  • [FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
  • Align date_range defaults with pandas, support tz (#15139) @mroeschke

πŸ› Bug Fixes

  • Backport: Use size_t to allow large conditional joins (#16127) (#16133) @bdice
  • Backport #16045 to 24.06 (#16102) @vyasr
  • Backport #16038 to 24.06 (#16101) @vyasr
  • Backport: Fix segfault in conditional join (#16094) (#16100) @bdice
  • Add patch for incorrect cuco noexcept clauses (#16077) @vyasr
  • Revert "Fix docs for IO readers and strings_convert" (#15872) @vyasr
  • Remove problematic call of index setter to unblock dask-cuda CI (#15844) @charlesbluca
  • Use rapidscpmnvtx3 to get same nvtx3 target state as rmm (#15840) @robertmaynard
  • Return boolean from confighostmemory_resource instead of throwing (#15815) @abellina
  • Add temporary dask-cudf workaround for categorical sorting (#15801) @rjzamora
  • Fix row group alignment in ORC writer (#15789) @vuule
  • Raise error when sorting by categorical column in dask-cudf (#15788) @rjzamora
  • Upgrade arrow to 16.1 (#15787) @galipremsagar
  • Add support for PandasArray for pandas&lt;2.1.0 (#15786) @galipremsagar
  • Limit runtime dependency to libarrow&gt;=16.0.0,&lt;16.1.0a0 (#15782) @pentschev
  • Fix cat.as_ordered not propogating correct size (#15780) @mroeschke
  • Handle mixed-like homogeneous types in isin (#15771) @galipremsagar
  • Fix idvars and valuevars not accepting string scalars in melt (#15765) @mroeschke
  • Fix DatetimeIndex.loc for all types of ordering cases (#15761) @galipremsagar
  • Fix arrow versioning logic (#15755) @vyasr
  • Avoid running sanitizer on Java test designed to cause an error (#15753) @jlowe
  • Handle empty dataframe object with index present in setitem of loc (#15752) @galipremsagar
  • Eliminate circular reference in DataFrame/Series.iloc/loc (#15749) @mroeschke
  • Cap the absolute row index per pass in parquet chunked reader. (#15735) @nvdbaranec
  • Fix Index.repeat for datetime64 types (#15722) @galipremsagar
  • Fix multibyte check for case convert for large strings (#15721) @davidwendt
  • Fix get_loc to properly fetch results from an index that is in decreasing order (#15719) @galipremsagar
  • Return same type as the original index for .loc operations (#15717) @galipremsagar
  • Correct static builds + static arrow (#15715) @robertmaynard
  • Raise errors for unsupported operations on certain types (#15712) @galipremsagar
  • Fix ColumnAccessor caching of nrows if empty previously (#15710) @mroeschke
  • Allow None when nan_as_null=False in column constructor (#15709) @galipremsagar
  • Refine CudaTest.testCudaException in case throwing wrong type of CudaError under aarch64 (#15706) @sperlingxx
  • Fix maxima of categorical column (#15701) @rjzamora
  • Add proxy for inplace operations in cudf.pandas (#15695) @galipremsagar
  • Make nan_as_null behavior consistent across all APIs (#15692) @galipremsagar
  • Fix CI s3 api command to fetch latest results (#15687) @galipremsagar
  • Add NumpyExtensionArray proxy type in cudf.pandas (#15686) @galipremsagar
  • Properly implement binaryops for proxy types (#15684) @galipremsagar
  • Fix copy assignment and the comparison operator of rmm_host_allocator (#15677) @vuule
  • Fix multi-source reading in JSON byte range reader (#15671) @shrshi
  • Return int64 when pandas compatible mode is turned on for get_indexer (#15659) @galipremsagar
  • Fix Index contains for error validations and float vs int comparisons (#15657) @galipremsagar
  • Preserve sub-second data for time scalars in column construction (#15655) @galipremsagar
  • Check row limit size in cudf::strings::join_strings (#15643) @davidwendt
  • Enable sorting on column with nulls using query-planning (#15639) @rjzamora
  • Fix operator precedence problem in Parquet reader (#15638) @etseidl
  • Fix decoding of dictionary encoded FIXEDLENBYTE_ARRAY data in Parquet reader (#15601) @etseidl
  • Fix debug warnings/errors in fromarrowdevice_test.cpp (#15596) @davidwendt
  • Add "collect" aggregation support to dask-cudf (#15593) @rjzamora
  • Fix categorical-accessor support and testing in dask-cudf (#15591) @rjzamora
  • Disable compute-sanitizer usage in CI tests with CUDA<11.6 (#15584) @davidwendt
  • Preserve RangeIndex.step in toarrow/fromarrow (#15581) @mroeschke
  • Ignore new cupy warning (#15574) @vyasr
  • Add cuda-sanitizer-api dependency for test-cpp matrix 11.4 (#15573) @davidwendt
  • Allow apply udf to reference global modules in cudf.pandas (#15569) @mroeschke
  • Fix deprecation warnings for json legacy reader (#15563) @davidwendt
  • Fix millisecond resampling in cudf Python (#15560) @mroeschke
  • Rename JSONREADEROPTION to JSONREADEROPTION_NVBENCH. (#15553) @bdice
  • Fix a JNI bug in JSON parsing fixup (#15550) @revans2
  • Remove conda channel setup from wheel CI image script. (#15539) @bdice
  • cudf.pandas: Series dt accessor is CombinedDatetimelikeProperties (#15523) @wence-
  • Fix for some compiler warnings in parquet/page_decode.cuh (#15518) @etseidl
  • Fix exponent overflow in strings-to-double conversion (#15517) @davidwendt
  • nanoarrow uses package override for proper pinned versions generation (#15515) @robertmaynard
  • Remove index name overrides in dask-cudf pyarrow table dispatch (#15514) @charlesbluca
  • Fix async synchronization issues in json_column.cu (#15497) @karthikeyann
  • Add new patch to hide more CCCL APIs (#15493) @vyasr
  • Make improvements in pandas-test reporting (#15485) @galipremsagar
  • Fixed page data truncation in parquet writer under certain conditions. (#15474) @nvdbaranec
  • Only use data_type constructor with scale for decimal types (#15472) @wence-
  • Avoid "p2p" shuffle as a default when dask_cudf is imported (#15469) @rjzamora
  • Fix debug build errors from toarrowdevice_test.cpp (#15463) @davidwendt
  • Fix basenormalator::integersizeof_fn integer dispatch (#15457) @davidwendt
  • Allow consumers of static builds to find nanoarrow (#15456) @robertmaynard
  • Allow jit compilation when using a splayed CUDA toolkit (#15451) @robertmaynard
  • Handle case of scan aggregation in groupby-transform (#15450) @wence-
  • Test static builds in CI and fix nanoarrow configure (#15437) @vyasr
  • Fixes potential race in JSON parser when parsing JSON lines format and when recovering from invalid lines (#15419) @elstehle
  • Fix errors in chunked ORC writer when no tables were (successfully) written (#15393) @vuule
  • Support implicit array conversion with query-planning enabled (#15378) @rjzamora
  • Fix arrow-based round trip of empty dataframes (#15373) @wence-
  • Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
  • Remove boundscheck=False setting in cython files (#15362) @wence-
  • Patch dask-expr var logic in dask-cudf (#15347) @rjzamora
  • Fix for logical and syntactical errors in libcudf c++ examples (#15346) @mhaseeb123
  • Disable dask-expr in docs builds. (#15343) @bdice
  • Apply the cuFile error work around to data_sink as well (#15335) @vuule
  • Fix parquet predicate filtering with column projection (#15113) @karthikeyann
  • Check column type equality, handling nested types correctly. (#14531) @bdice

πŸ“– Documentation

  • Fix docs for IO readers and strings_convert (#15842) @bdice
  • Update cudf.pandas docs for GA (#15744) @beckernick
  • Add contributing warning about circular imports (#15691) @er-eis
  • Update libcudf developer guide for strings offsets column (#15661) @davidwendt
  • Update developer guide with deviceasyncresource_ref guidelines (#15562) @harrism
  • DOC: add pandas intersphinx mapping (#15531) @raybellwaves
  • rm-dup-doc in frame.py (#15530) @raybellwaves
  • Update CONTRIBUTING.md to use latest cuda env (#15467) @raybellwaves
  • Doc: interleave columns pandas compat (#15383) @raybellwaves
  • Simplified README Examples (#15338) @wkaisertexas
  • Add debug tips section to libcudf developer guide (#15329) @davidwendt
  • Fix and clarify notes on result ordering (#13255) @shwina

πŸš€ New Features

  • Add JNI bindings for zstd compression of NVCOMP. (#15729) @firestarman
  • Fix spaces around CSV quoted strings (#15727) @thabetx
  • Add default pinned pool that falls back to new pinned allocations (#15665) @vuule
  • Overhaul ops-codeowners coverage (#15660) @raydouglass
  • Concatenate dictionary of objects along axis=1 (#15623) @er-eis
  • Construct pylibcudf columns from objects supporting __cuda_array_interface__ (#15615) @brandon-b-miller
  • Expose some Parquet per-column configuration options via the python API (#15613) @etseidl
  • Migrate string find operations to pylibcudf (#15604) @brandon-b-miller
  • Round trip FIXEDLENBYTE_ARRAY data properly in Parquet writer (#15600) @etseidl
  • Reading multi-line JSON in string columns using runtime configurable delimiter (#15556) @shrshi
  • Remove public gtest dependency from libcudf conda package (#15534) @robertmaynard
  • Fea/move to latest nanoarrow (#15526) @robertmaynard
  • Migrate string case operations to pylibcudf (#15489) @brandon-b-miller
  • Add Parquet encoding statistics to column chunk metadata (#15452) @etseidl
  • Implement JNI for chunked ORC reader (#15446) @ttnghia
  • Add some missing optional fields to the Parquet RowGroup metadata (#15421) @etseidl
  • Adding parquet transcoding example (#15420) @mhaseeb123
  • Add fields to Parquet Statistics structure that were added in parquet-format 2.10 (#15412) @etseidl
  • Add option to Parquet writer to skip compressing individual columns (#15411) @etseidl
  • Add BYTESTREAMSPLIT support to Parquet (#15311) @etseidl
  • Introduce benchmark suite for JSON reader options (#15124) @shrshi
  • Implement ORC chunked reader (#15094) @ttnghia
  • Extend cudf devcontainers to specify jitify2 kernel cache (#15068) @robertmaynard
  • Add to_arrow_device function to cudf interop using nanoarrow (#15047) @zeroshade
  • Add JSON option to prune columns (#14996) @karthikeyann

πŸ› οΈ Improvements

  • Deprecate Groupby.collect (#15808) @galipremsagar
  • Raise FileNotFoundError when a literal JSON string that looks like a json filename is passed (#15806) @lithomas1
  • Deprecate divisions=&#39;quantile&#39; support in set_index (#15804) @rjzamora
  • Improve performance of Series.tonumpy/tocupy (#15792) @mroeschke
  • Access self.index instead of self._index where possible (#15781) @mroeschke
  • Support filtered I/O in chunked_parquet_reader and simplify the use of parquet_reader_options (#15764) @mhaseeb123
  • Avoid index-to-column conversion in some DataFrame ops (#15763) @mroeschke
  • Fix chunked_parquet_reader behavior when input has no more rows to read (#15757) @mhaseeb123
  • [JNI] Expose java API for cudf::io::confighostmemory_resource (#15745) @abellina
  • Migrate all cpp pxd files into pylibcudf (#15740) @vyasr
  • Validate and materialize iterators earlier in as_column (#15739) @mroeschke
  • Push some ascolumn arrow logic to ColumnBase.fromarrow (#15738) @mroeschke
  • Expose stream parameter in public reduction APIs (#15737) @srinivasyadav18
  • remove unnecessary 'setuptools' host dependency, simplify dependencies.yaml (#15736) @jameslamb
  • Defer to C++ equality and hashing for pylibcudf DataType and Aggregation objects (#15732) @wence-
  • Implement null-aware NOT_EQUALS binop (#15731) @wence-
  • Fix split-record result list column offset type (#15707) @davidwendt
  • Upgrade arrow to 16 (#15703) @galipremsagar
  • Remove experimental namespace from makestringschildren (#15702) @davidwendt
  • Rework getjsonobject benchmark to use nvbench (#15698) @davidwendt
  • Rework some python tests of Parquet delta encodings (#15693) @etseidl
  • Skeleton cudf polars package (#15688) @wence-
  • Upgrade pre commit hooks (#15685) @wence-
  • Allow fillna to validate for CategoricalColumn.fillna (#15683) @galipremsagar
  • Misc Column cleanups (#15682) @mroeschke
  • Reducing runtime of JSON reader options benchmark (#15681) @shrshi
  • Add Timestamp and Timedelta proxy types (#15680) @galipremsagar
  • Remove hostparsenested_json. (#15674) @bdice
  • Reduce runtime for ParquetChunkedReaderInputLimitTest gtests (#15672) @davidwendt
  • Add large-strings gtest for cudf::interleave_columns (#15669) @davidwendt
  • Use experimental makestringschildren for multi-replace_re (#15667) @davidwendt
  • Enabled Holiday types in cudf.pandas (#15664) @galipremsagar
  • Remove obsolete XFAIL markers for query-planning (#15662) @rjzamora
  • Clean up join benchmarks (#15644) @PointKernel
  • Enable warnings as errors in custreamz (#15642) @mroeschke
  • Improve distinct join with set retrieve (#15636) @PointKernel
  • Fix -Werror=type-limits. (#15635) @bdice
  • Enable FutureWarnings/DeprecationWarnings as errors for dask_cudf (#15634) @mroeschke
  • Remove NVBench SHA override. (#15633) @alliepiper
  • Add support for large string columns to Parquet reader and writer (#15632) @etseidl
  • Large strings support in MD5 and SHA hashers (#15631) @davidwendt
  • Fix makeoffsetschild_column usage in cudf::strings::detail::shift (#15630) @davidwendt
  • Use experimental makestringschildren for strings convert (#15629) @davidwendt
  • Forward-merge branch-24.04 to branch-24.06 (#15627) @bdice
  • Avoid accessing attributes via _column if not needed (#15624) @mroeschke
  • Make ColumnBase.cudaarrayinterface opt out instead of opt in (#15622) @mroeschke
  • Large strings support for cudf::gather (#15621) @davidwendt
  • Remove jni-docker-build workflow (#15619) @bdice
  • Support DurationType in cudf parquet reader via arrow:schema (#15617) @mhaseeb123
  • Drop Centos7 support (#15608) @NvTimLiu
  • Use experimental makestringschildren for json/csv writers (#15599) @davidwendt
  • Use experimental makestringschildren for strings join/url_encode/slice (#15598) @davidwendt
  • Use experimental makestringschildren in nvtext APIs (#15595) @davidwendt
  • Migrate to {{ stdlib(&quot;c&quot;) }} (#15594) @hcho3
  • Deprecate to/from_dask_dataframe APIs in dask-cudf (#15592) @rjzamora
  • Minor fixups for future NumPy 2 compatibility (#15590) @seberg
  • Delay materializing RangeIndex in .reset_index (#15588) @mroeschke
  • Use experimental makestringschildren for capitalize/case/pad functions (#15587) @davidwendt
  • Use experimental makestringschildren for strings replace/filter/translate (#15586) @davidwendt
  • Add multithreaded parquet reader benchmarks. (#15585) @nvdbaranec
  • Don't materialize column during RangeIndex methods (#15582) @mroeschke
  • Improve performance for cudf::strings::count_re (#15578) @davidwendt
  • Replace RangeIndex.start/stop/_step with _range (#15576) @mroeschke
  • add --rm and --name to devcontainer run args (#15572) @trxcllnt
  • Change the default dictionary policy in Parquet writer from ALWAYS to ADAPTIVE (#15570) @mhaseeb123
  • Rename experimental JSON tests. (#15568) @bdice
  • Refactor JNI native dependency loading to allow returning of library path (#15566) @jlowe
  • Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
  • Deprecate legacy JSON reader options. (#15558) @bdice
  • Use same .clang-format in cuDF JNI (#15557) @bdice
  • Large strings support for cudf::fill (#15555) @davidwendt
  • Upgrade upper bound pinning to pandas-2.2.2 (#15554) @galipremsagar
  • Work around issues with cccl main (#15552) @miscco
  • Enable pandas plotting unit tests for cudf.pandas (#15547) @mroeschke
  • Move timezone conversion logic to DatetimeColumn (#15545) @mroeschke
  • Large strings support for cudf::interleave_columns (#15544) @davidwendt
  • [skip ci] Switch back to 24.06 branch for pandas tests (#15543) @galipremsagar
  • Remove checks dependency from static-configure test job. (#15542) @bdice
  • Remove legacy JSON reader from Python (#15538) @bdice
  • Enable more ignored pandas unit tests for cudf.pandas (#15535) @mroeschke
  • Large strings support for cudf::clamp (#15533) @davidwendt
  • Remove version hard-coding (#15529) @galipremsagar
  • Removing all batching code from parquet writer (#15528) @mhaseeb123
  • Make some private class properties not settable (#15527) @mroeschke
  • Large strings support in regex replace APIs (#15524) @davidwendt
  • Skip pandas unit tests that crash pytest workers in cudf.pandas (#15521) @mroeschke
  • Preserve column metadata during more DataFrame operations (#15519) @mroeschke
  • Move to pandas-tests to a dedicated workflow file and trigger it from branch.yaml (#15516) @galipremsagar
  • Large strings gtest fixture and utilities (#15513) @davidwendt
  • Convert libcudf resource parameters to rmm::deviceasyncresource_ref (#15507) @harrism
  • Relax protobuf lower bound to 3.20. (#15506) @bdice
  • Clean up index methods (#15496) @mroeschke
  • Update strings contains benchmarks to nvbench (#15495) @davidwendt
  • Update NVBench fixture to use new hooks, fix pinned memory segfault. (#15492) @alliepiper
  • Enable tests/scalar and test/series in cudf.pandas tests (#15486) @mroeschke
  • Clean up cudaarrayinterface handling in as_column (#15477) @mroeschke
  • Avoid .ordered and .categories from being settable in CategoricalColumn and CategoricalDtype (#15475) @mroeschke
  • Ignore pandas tests for cudf.pandas that need motoserver (#15468) @mroeschke
  • Use cachedproperty for NumericColumn.nancount instead of .nancount variable (#15466) @mroeschke
  • Add toarrowdevice() functions that accept views (#15465) @davidwendt
  • Add custom status check workflow (#15464) @galipremsagar
  • Disable pandas 2.x clipboard tests in cudf.pandas tests (#15462) @mroeschke
  • Enable tests/strings/test_api.py and tests/io/pytables in cudf.pandas tests (#15461) @mroeschke
  • Enable test_parsing in cudf.pandas tests (#15460) @mroeschke
  • Add from_arrow_device function to cudf interop using nanoarrow (#15458) @zeroshade
  • Remove deprecated strings offsets_begin (#15454) @davidwendt
  • Enable tests/windows/ in cudf.pandas tests (#15444) @mroeschke
  • Enable tests/interchange/test_impl.py in cudf.pandas tests (#15443) @mroeschke
  • Enable tests/io/testuseragent.py in cudf pandas tests (#15442) @mroeschke
  • Performance improvement in libcudf case conversion for long strings (#15441) @davidwendt
  • Remove prior test skipping in run-pandas-tests with testing 2.2.1 (#15440) @mroeschke
  • Support orc and text IO with dask-expr using legacy conversion (#15439) @rjzamora
  • Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
  • Unify Copy-On-Write and Spilling (#15436) @madsbk
  • Enable dask_cudf json and s3 tests with query-planning on (#15408) @rjzamora
  • Bump ruff and codespell pre-commit checks (#15407) @mroeschke
  • Enable all tests for arm arch (#15402) @galipremsagar
  • Bind read_parquet_metadata API to libcudf instead of pyarrow and extract RowGroup information (#15398) @mhaseeb123
  • Optimizing multi-source byte range reading in JSON reader (#15396) @shrshi
  • add correct labels to pandasfunctionrequest.md (#15381) @raybellwaves
  • Remove deprecated hash() and sparkmurmurhash3x86_32() (#15375) @davidwendt
  • Large strings support in cudf::merge (#15374) @davidwendt
  • Enable test-reporting for pandas pytests in CI (#15369) @galipremsagar
  • Use logical types in Parquet reader (#15365) @etseidl
  • Add experimental makestringschildren utility (#15363) @davidwendt
  • Forward-merge branch-24.04 to branch-24.06 (#15349) @bdice
  • Fix CMake files in libcudf C++ examples to use existing libcudf build if present (#15348) @mhaseeb123
  • Use ruff pydocstyle over pydocstyle pre-commit hook (#15345) @mroeschke
  • Refactor stream mode setup for gtests (#15337) @davidwendt
  • Benchmark decimal <--> floating conversions. (#15334) @pmattione-nvidia
  • Avoid duplicate dask-cudf testing (#15333) @rjzamora
  • Skip decode steps in Parquet reader when nullable columns have no nulls (#15332) @etseidl
  • Update udfcpp to use rapidscpm_cccl. (#15331) @bdice
  • Forward-merge branch-24.04 into branch-24.06 skip ci @rapids-bot[bot]
  • Allow numeric_only=True for simple groupby reductions (#15326) @rjzamora
  • Drop CentOS 7 support. (#15323) @bdice
  • Rework cudf::findandreplaceall to use gather-based makestrings_column (#15305) @davidwendt
  • First pass at adding testing for pylibcudf (#15300) @vyasr
  • [FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
  • Rework cudf::replacenulls to use strings::detail::copyif_else (#15286) @davidwendt
  • Clean up special casing in as_column for non-typed input (#15276) @mroeschke
  • Large strings support in cudf::concatenate (#15195) @davidwendt
  • Use less iscategorical_dtype (#15148) @mroeschke
  • Align date_range defaults with pandas, support tz (#15139) @mroeschke
  • ModuleAccelerator performance: cache the result of checking if a caller is in the denylist (#15056) @shwina
  • Use offsetalator in cudf::strings::replace functions (#14824) @davidwendt
  • Cleanup some timedelta/datetime column logic (#14715) @mroeschke
  • Refactor numpy array input in as_column (#14651) @mroeschke
  • Refactor joins for conditional semis and antis (#14646) @DanialJavady96
  • Eagerly populate the class dict for cudf.pandas proxy types (#14534) @shwina
  • Some additional kernel thread index refactoring. (#14107) @bdice

- C++
Published by raydouglass over 1 year ago

https://github.com/rapidsai/cudf - v24.06.00

🚨 Breaking Changes

  • Deprecate Groupby.collect (#15808) @galipremsagar
  • Raise FileNotFoundError when a literal JSON string that looks like a json filename is passed (#15806) @lithomas1
  • Support filtered I/O in chunked_parquet_reader and simplify the use of parquet_reader_options (#15764) @mhaseeb123
  • Raise errors for unsupported operations on certain types (#15712) @galipremsagar
  • Support DurationType in cudf parquet reader via arrow:schema (#15617) @mhaseeb123
  • Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
  • Remove legacy JSON reader from Python (#15538) @bdice
  • Removing all batching code from parquet writer (#15528) @mhaseeb123
  • Convert libcudf resource parameters to rmm::deviceasyncresource_ref (#15507) @harrism
  • Remove deprecated strings offsets_begin (#15454) @davidwendt
  • Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
  • Bind read_parquet_metadata API to libcudf instead of pyarrow and extract RowGroup information (#15398) @mhaseeb123
  • Remove deprecated hash() and sparkmurmurhash3x86_32() (#15375) @davidwendt
  • Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
  • [FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
  • Align date_range defaults with pandas, support tz (#15139) @mroeschke

πŸ› Bug Fixes

  • Revert "Fix docs for IO readers and strings_convert" (#15872) @vyasr
  • Remove problematic call of index setter to unblock dask-cuda CI (#15844) @charlesbluca
  • Use rapidscpmnvtx3 to get same nvtx3 target state as rmm (#15840) @robertmaynard
  • Return boolean from confighostmemory_resource instead of throwing (#15815) @abellina
  • Add temporary dask-cudf workaround for categorical sorting (#15801) @rjzamora
  • Fix row group alignment in ORC writer (#15789) @vuule
  • Raise error when sorting by categorical column in dask-cudf (#15788) @rjzamora
  • Upgrade arrow to 16.1 (#15787) @galipremsagar
  • Add support for PandasArray for pandas&lt;2.1.0 (#15786) @galipremsagar
  • Limit runtime dependency to libarrow&gt;=16.0.0,&lt;16.1.0a0 (#15782) @pentschev
  • Fix cat.as_ordered not propogating correct size (#15780) @mroeschke
  • Handle mixed-like homogeneous types in isin (#15771) @galipremsagar
  • Fix idvars and valuevars not accepting string scalars in melt (#15765) @mroeschke
  • Fix DatetimeIndex.loc for all types of ordering cases (#15761) @galipremsagar
  • Fix arrow versioning logic (#15755) @vyasr
  • Avoid running sanitizer on Java test designed to cause an error (#15753) @jlowe
  • Handle empty dataframe object with index present in setitem of loc (#15752) @galipremsagar
  • Eliminate circular reference in DataFrame/Series.iloc/loc (#15749) @mroeschke
  • Cap the absolute row index per pass in parquet chunked reader. (#15735) @nvdbaranec
  • Fix Index.repeat for datetime64 types (#15722) @galipremsagar
  • Fix multibyte check for case convert for large strings (#15721) @davidwendt
  • Fix get_loc to properly fetch results from an index that is in decreasing order (#15719) @galipremsagar
  • Return same type as the original index for .loc operations (#15717) @galipremsagar
  • Correct static builds + static arrow (#15715) @robertmaynard
  • Raise errors for unsupported operations on certain types (#15712) @galipremsagar
  • Fix ColumnAccessor caching of nrows if empty previously (#15710) @mroeschke
  • Allow None when nan_as_null=False in column constructor (#15709) @galipremsagar
  • Refine CudaTest.testCudaException in case throwing wrong type of CudaError under aarch64 (#15706) @sperlingxx
  • Fix maxima of categorical column (#15701) @rjzamora
  • Add proxy for inplace operations in cudf.pandas (#15695) @galipremsagar
  • Make nan_as_null behavior consistent across all APIs (#15692) @galipremsagar
  • Fix CI s3 api command to fetch latest results (#15687) @galipremsagar
  • Add NumpyExtensionArray proxy type in cudf.pandas (#15686) @galipremsagar
  • Properly implement binaryops for proxy types (#15684) @galipremsagar
  • Fix copy assignment and the comparison operator of rmm_host_allocator (#15677) @vuule
  • Fix multi-source reading in JSON byte range reader (#15671) @shrshi
  • Return int64 when pandas compatible mode is turned on for get_indexer (#15659) @galipremsagar
  • Fix Index contains for error validations and float vs int comparisons (#15657) @galipremsagar
  • Preserve sub-second data for time scalars in column construction (#15655) @galipremsagar
  • Check row limit size in cudf::strings::join_strings (#15643) @davidwendt
  • Enable sorting on column with nulls using query-planning (#15639) @rjzamora
  • Fix operator precedence problem in Parquet reader (#15638) @etseidl
  • Fix decoding of dictionary encoded FIXEDLENBYTE_ARRAY data in Parquet reader (#15601) @etseidl
  • Fix debug warnings/errors in fromarrowdevice_test.cpp (#15596) @davidwendt
  • Add "collect" aggregation support to dask-cudf (#15593) @rjzamora
  • Fix categorical-accessor support and testing in dask-cudf (#15591) @rjzamora
  • Disable compute-sanitizer usage in CI tests with CUDA<11.6 (#15584) @davidwendt
  • Preserve RangeIndex.step in toarrow/fromarrow (#15581) @mroeschke
  • Ignore new cupy warning (#15574) @vyasr
  • Add cuda-sanitizer-api dependency for test-cpp matrix 11.4 (#15573) @davidwendt
  • Allow apply udf to reference global modules in cudf.pandas (#15569) @mroeschke
  • Fix deprecation warnings for json legacy reader (#15563) @davidwendt
  • Fix millisecond resampling in cudf Python (#15560) @mroeschke
  • Rename JSONREADEROPTION to JSONREADEROPTION_NVBENCH. (#15553) @bdice
  • Fix a JNI bug in JSON parsing fixup (#15550) @revans2
  • Remove conda channel setup from wheel CI image script. (#15539) @bdice
  • cudf.pandas: Series dt accessor is CombinedDatetimelikeProperties (#15523) @wence-
  • Fix for some compiler warnings in parquet/page_decode.cuh (#15518) @etseidl
  • Fix exponent overflow in strings-to-double conversion (#15517) @davidwendt
  • nanoarrow uses package override for proper pinned versions generation (#15515) @robertmaynard
  • Remove index name overrides in dask-cudf pyarrow table dispatch (#15514) @charlesbluca
  • Fix async synchronization issues in json_column.cu (#15497) @karthikeyann
  • Add new patch to hide more CCCL APIs (#15493) @vyasr
  • Make improvements in pandas-test reporting (#15485) @galipremsagar
  • Fixed page data truncation in parquet writer under certain conditions. (#15474) @nvdbaranec
  • Only use data_type constructor with scale for decimal types (#15472) @wence-
  • Avoid "p2p" shuffle as a default when dask_cudf is imported (#15469) @rjzamora
  • Fix debug build errors from toarrowdevice_test.cpp (#15463) @davidwendt
  • Fix basenormalator::integersizeof_fn integer dispatch (#15457) @davidwendt
  • Allow consumers of static builds to find nanoarrow (#15456) @robertmaynard
  • Allow jit compilation when using a splayed CUDA toolkit (#15451) @robertmaynard
  • Handle case of scan aggregation in groupby-transform (#15450) @wence-
  • Test static builds in CI and fix nanoarrow configure (#15437) @vyasr
  • Fixes potential race in JSON parser when parsing JSON lines format and when recovering from invalid lines (#15419) @elstehle
  • Fix errors in chunked ORC writer when no tables were (successfully) written (#15393) @vuule
  • Support implicit array conversion with query-planning enabled (#15378) @rjzamora
  • Fix arrow-based round trip of empty dataframes (#15373) @wence-
  • Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
  • Remove boundscheck=False setting in cython files (#15362) @wence-
  • Patch dask-expr var logic in dask-cudf (#15347) @rjzamora
  • Fix for logical and syntactical errors in libcudf c++ examples (#15346) @mhaseeb123
  • Disable dask-expr in docs builds. (#15343) @bdice
  • Apply the cuFile error work around to data_sink as well (#15335) @vuule
  • Fix parquet predicate filtering with column projection (#15113) @karthikeyann
  • Check column type equality, handling nested types correctly. (#14531) @bdice

πŸ“– Documentation

  • Fix docs for IO readers and strings_convert (#15842) @bdice
  • Update cudf.pandas docs for GA (#15744) @beckernick
  • Add contributing warning about circular imports (#15691) @er-eis
  • Update libcudf developer guide for strings offsets column (#15661) @davidwendt
  • Update developer guide with deviceasyncresource_ref guidelines (#15562) @harrism
  • DOC: add pandas intersphinx mapping (#15531) @raybellwaves
  • rm-dup-doc in frame.py (#15530) @raybellwaves
  • Update CONTRIBUTING.md to use latest cuda env (#15467) @raybellwaves
  • Doc: interleave columns pandas compat (#15383) @raybellwaves
  • Simplified README Examples (#15338) @wkaisertexas
  • Add debug tips section to libcudf developer guide (#15329) @davidwendt
  • Fix and clarify notes on result ordering (#13255) @shwina

πŸš€ New Features

  • Add JNI bindings for zstd compression of NVCOMP. (#15729) @firestarman
  • Fix spaces around CSV quoted strings (#15727) @thabetx
  • Add default pinned pool that falls back to new pinned allocations (#15665) @vuule
  • Overhaul ops-codeowners coverage (#15660) @raydouglass
  • Concatenate dictionary of objects along axis=1 (#15623) @er-eis
  • Construct pylibcudf columns from objects supporting __cuda_array_interface__ (#15615) @brandon-b-miller
  • Expose some Parquet per-column configuration options via the python API (#15613) @etseidl
  • Migrate string find operations to pylibcudf (#15604) @brandon-b-miller
  • Round trip FIXEDLENBYTE_ARRAY data properly in Parquet writer (#15600) @etseidl
  • Reading multi-line JSON in string columns using runtime configurable delimiter (#15556) @shrshi
  • Remove public gtest dependency from libcudf conda package (#15534) @robertmaynard
  • Fea/move to latest nanoarrow (#15526) @robertmaynard
  • Migrate string case operations to pylibcudf (#15489) @brandon-b-miller
  • Add Parquet encoding statistics to column chunk metadata (#15452) @etseidl
  • Implement JNI for chunked ORC reader (#15446) @ttnghia
  • Add some missing optional fields to the Parquet RowGroup metadata (#15421) @etseidl
  • Adding parquet transcoding example (#15420) @mhaseeb123
  • Add fields to Parquet Statistics structure that were added in parquet-format 2.10 (#15412) @etseidl
  • Add option to Parquet writer to skip compressing individual columns (#15411) @etseidl
  • Add BYTESTREAMSPLIT support to Parquet (#15311) @etseidl
  • Introduce benchmark suite for JSON reader options (#15124) @shrshi
  • Implement ORC chunked reader (#15094) @ttnghia
  • Extend cudf devcontainers to specify jitify2 kernel cache (#15068) @robertmaynard
  • Add to_arrow_device function to cudf interop using nanoarrow (#15047) @zeroshade
  • Add JSON option to prune columns (#14996) @karthikeyann

πŸ› οΈ Improvements

  • Deprecate Groupby.collect (#15808) @galipremsagar
  • Raise FileNotFoundError when a literal JSON string that looks like a json filename is passed (#15806) @lithomas1
  • Deprecate divisions=&#39;quantile&#39; support in set_index (#15804) @rjzamora
  • Improve performance of Series.tonumpy/tocupy (#15792) @mroeschke
  • Access self.index instead of self._index where possible (#15781) @mroeschke
  • Support filtered I/O in chunked_parquet_reader and simplify the use of parquet_reader_options (#15764) @mhaseeb123
  • Avoid index-to-column conversion in some DataFrame ops (#15763) @mroeschke
  • Fix chunked_parquet_reader behavior when input has no more rows to read (#15757) @mhaseeb123
  • [JNI] Expose java API for cudf::io::confighostmemory_resource (#15745) @abellina
  • Migrate all cpp pxd files into pylibcudf (#15740) @vyasr
  • Validate and materialize iterators earlier in as_column (#15739) @mroeschke
  • Push some ascolumn arrow logic to ColumnBase.fromarrow (#15738) @mroeschke
  • Expose stream parameter in public reduction APIs (#15737) @srinivasyadav18
  • remove unnecessary 'setuptools' host dependency, simplify dependencies.yaml (#15736) @jameslamb
  • Defer to C++ equality and hashing for pylibcudf DataType and Aggregation objects (#15732) @wence-
  • Implement null-aware NOT_EQUALS binop (#15731) @wence-
  • Fix split-record result list column offset type (#15707) @davidwendt
  • Upgrade arrow to 16 (#15703) @galipremsagar
  • Remove experimental namespace from makestringschildren (#15702) @davidwendt
  • Rework getjsonobject benchmark to use nvbench (#15698) @davidwendt
  • Rework some python tests of Parquet delta encodings (#15693) @etseidl
  • Skeleton cudf polars package (#15688) @wence-
  • Upgrade pre commit hooks (#15685) @wence-
  • Allow fillna to validate for CategoricalColumn.fillna (#15683) @galipremsagar
  • Misc Column cleanups (#15682) @mroeschke
  • Reducing runtime of JSON reader options benchmark (#15681) @shrshi
  • Add Timestamp and Timedelta proxy types (#15680) @galipremsagar
  • Remove hostparsenested_json. (#15674) @bdice
  • Reduce runtime for ParquetChunkedReaderInputLimitTest gtests (#15672) @davidwendt
  • Add large-strings gtest for cudf::interleave_columns (#15669) @davidwendt
  • Use experimental makestringschildren for multi-replace_re (#15667) @davidwendt
  • Enabled Holiday types in cudf.pandas (#15664) @galipremsagar
  • Remove obsolete XFAIL markers for query-planning (#15662) @rjzamora
  • Clean up join benchmarks (#15644) @PointKernel
  • Enable warnings as errors in custreamz (#15642) @mroeschke
  • Improve distinct join with set retrieve (#15636) @PointKernel
  • Fix -Werror=type-limits. (#15635) @bdice
  • Enable FutureWarnings/DeprecationWarnings as errors for dask_cudf (#15634) @mroeschke
  • Remove NVBench SHA override. (#15633) @alliepiper
  • Add support for large string columns to Parquet reader and writer (#15632) @etseidl
  • Large strings support in MD5 and SHA hashers (#15631) @davidwendt
  • Fix makeoffsetschild_column usage in cudf::strings::detail::shift (#15630) @davidwendt
  • Use experimental makestringschildren for strings convert (#15629) @davidwendt
  • Forward-merge branch-24.04 to branch-24.06 (#15627) @bdice
  • Avoid accessing attributes via _column if not needed (#15624) @mroeschke
  • Make ColumnBase.cudaarrayinterface opt out instead of opt in (#15622) @mroeschke
  • Large strings support for cudf::gather (#15621) @davidwendt
  • Remove jni-docker-build workflow (#15619) @bdice
  • Support DurationType in cudf parquet reader via arrow:schema (#15617) @mhaseeb123
  • Drop Centos7 support (#15608) @NvTimLiu
  • Use experimental makestringschildren for json/csv writers (#15599) @davidwendt
  • Use experimental makestringschildren for strings join/url_encode/slice (#15598) @davidwendt
  • Use experimental makestringschildren in nvtext APIs (#15595) @davidwendt
  • Migrate to {{ stdlib(&quot;c&quot;) }} (#15594) @hcho3
  • Deprecate to/from_dask_dataframe APIs in dask-cudf (#15592) @rjzamora
  • Minor fixups for future NumPy 2 compatibility (#15590) @seberg
  • Delay materializing RangeIndex in .reset_index (#15588) @mroeschke
  • Use experimental makestringschildren for capitalize/case/pad functions (#15587) @davidwendt
  • Use experimental makestringschildren for strings replace/filter/translate (#15586) @davidwendt
  • Add multithreaded parquet reader benchmarks. (#15585) @nvdbaranec
  • Don't materialize column during RangeIndex methods (#15582) @mroeschke
  • Improve performance for cudf::strings::count_re (#15578) @davidwendt
  • Replace RangeIndex.start/stop/_step with _range (#15576) @mroeschke
  • add --rm and --name to devcontainer run args (#15572) @trxcllnt
  • Change the default dictionary policy in Parquet writer from ALWAYS to ADAPTIVE (#15570) @mhaseeb123
  • Rename experimental JSON tests. (#15568) @bdice
  • Refactor JNI native dependency loading to allow returning of library path (#15566) @jlowe
  • Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
  • Deprecate legacy JSON reader options. (#15558) @bdice
  • Use same .clang-format in cuDF JNI (#15557) @bdice
  • Large strings support for cudf::fill (#15555) @davidwendt
  • Upgrade upper bound pinning to pandas-2.2.2 (#15554) @galipremsagar
  • Work around issues with cccl main (#15552) @miscco
  • Enable pandas plotting unit tests for cudf.pandas (#15547) @mroeschke
  • Move timezone conversion logic to DatetimeColumn (#15545) @mroeschke
  • Large strings support for cudf::interleave_columns (#15544) @davidwendt
  • [skip ci] Switch back to 24.06 branch for pandas tests (#15543) @galipremsagar
  • Remove checks dependency from static-configure test job. (#15542) @bdice
  • Remove legacy JSON reader from Python (#15538) @bdice
  • Enable more ignored pandas unit tests for cudf.pandas (#15535) @mroeschke
  • Large strings support for cudf::clamp (#15533) @davidwendt
  • Remove version hard-coding (#15529) @galipremsagar
  • Removing all batching code from parquet writer (#15528) @mhaseeb123
  • Make some private class properties not settable (#15527) @mroeschke
  • Large strings support in regex replace APIs (#15524) @davidwendt
  • Skip pandas unit tests that crash pytest workers in cudf.pandas (#15521) @mroeschke
  • Preserve column metadata during more DataFrame operations (#15519) @mroeschke
  • Move to pandas-tests to a dedicated workflow file and trigger it from branch.yaml (#15516) @galipremsagar
  • Large strings gtest fixture and utilities (#15513) @davidwendt
  • Convert libcudf resource parameters to rmm::deviceasyncresource_ref (#15507) @harrism
  • Relax protobuf lower bound to 3.20. (#15506) @bdice
  • Clean up index methods (#15496) @mroeschke
  • Update strings contains benchmarks to nvbench (#15495) @davidwendt
  • Update NVBench fixture to use new hooks, fix pinned memory segfault. (#15492) @alliepiper
  • Enable tests/scalar and test/series in cudf.pandas tests (#15486) @mroeschke
  • Clean up cudaarrayinterface handling in as_column (#15477) @mroeschke
  • Avoid .ordered and .categories from being settable in CategoricalColumn and CategoricalDtype (#15475) @mroeschke
  • Ignore pandas tests for cudf.pandas that need motoserver (#15468) @mroeschke
  • Use cachedproperty for NumericColumn.nancount instead of .nancount variable (#15466) @mroeschke
  • Add toarrowdevice() functions that accept views (#15465) @davidwendt
  • Add custom status check workflow (#15464) @galipremsagar
  • Disable pandas 2.x clipboard tests in cudf.pandas tests (#15462) @mroeschke
  • Enable tests/strings/test_api.py and tests/io/pytables in cudf.pandas tests (#15461) @mroeschke
  • Enable test_parsing in cudf.pandas tests (#15460) @mroeschke
  • Add from_arrow_device function to cudf interop using nanoarrow (#15458) @zeroshade
  • Remove deprecated strings offsets_begin (#15454) @davidwendt
  • Enable tests/windows/ in cudf.pandas tests (#15444) @mroeschke
  • Enable tests/interchange/test_impl.py in cudf.pandas tests (#15443) @mroeschke
  • Enable tests/io/testuseragent.py in cudf pandas tests (#15442) @mroeschke
  • Performance improvement in libcudf case conversion for long strings (#15441) @davidwendt
  • Remove prior test skipping in run-pandas-tests with testing 2.2.1 (#15440) @mroeschke
  • Support orc and text IO with dask-expr using legacy conversion (#15439) @rjzamora
  • Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
  • Unify Copy-On-Write and Spilling (#15436) @madsbk
  • Enable dask_cudf json and s3 tests with query-planning on (#15408) @rjzamora
  • Bump ruff and codespell pre-commit checks (#15407) @mroeschke
  • Enable all tests for arm arch (#15402) @galipremsagar
  • Bind read_parquet_metadata API to libcudf instead of pyarrow and extract RowGroup information (#15398) @mhaseeb123
  • Optimizing multi-source byte range reading in JSON reader (#15396) @shrshi
  • add correct labels to pandasfunctionrequest.md (#15381) @raybellwaves
  • Remove deprecated hash() and sparkmurmurhash3x86_32() (#15375) @davidwendt
  • Large strings support in cudf::merge (#15374) @davidwendt
  • Enable test-reporting for pandas pytests in CI (#15369) @galipremsagar
  • Use logical types in Parquet reader (#15365) @etseidl
  • Add experimental makestringschildren utility (#15363) @davidwendt
  • Forward-merge branch-24.04 to branch-24.06 (#15349) @bdice
  • Fix CMake files in libcudf C++ examples to use existing libcudf build if present (#15348) @mhaseeb123
  • Use ruff pydocstyle over pydocstyle pre-commit hook (#15345) @mroeschke
  • Refactor stream mode setup for gtests (#15337) @davidwendt
  • Benchmark decimal <--> floating conversions. (#15334) @pmattione-nvidia
  • Avoid duplicate dask-cudf testing (#15333) @rjzamora
  • Skip decode steps in Parquet reader when nullable columns have no nulls (#15332) @etseidl
  • Update udfcpp to use rapidscpm_cccl. (#15331) @bdice
  • Forward-merge branch-24.04 into branch-24.06 skip ci @rapids-bot[bot]
  • Allow numeric_only=True for simple groupby reductions (#15326) @rjzamora
  • Drop CentOS 7 support. (#15323) @bdice
  • Rework cudf::findandreplaceall to use gather-based makestrings_column (#15305) @davidwendt
  • First pass at adding testing for pylibcudf (#15300) @vyasr
  • [FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
  • Rework cudf::replacenulls to use strings::detail::copyif_else (#15286) @davidwendt
  • Clean up special casing in as_column for non-typed input (#15276) @mroeschke
  • Large strings support in cudf::concatenate (#15195) @davidwendt
  • Use less iscategorical_dtype (#15148) @mroeschke
  • Align date_range defaults with pandas, support tz (#15139) @mroeschke
  • ModuleAccelerator performance: cache the result of checking if a caller is in the denylist (#15056) @shwina
  • Use offsetalator in cudf::strings::replace functions (#14824) @davidwendt
  • Cleanup some timedelta/datetime column logic (#14715) @mroeschke
  • Refactor numpy array input in as_column (#14651) @mroeschke
  • Refactor joins for conditional semis and antis (#14646) @DanialJavady96
  • Eagerly populate the class dict for cudf.pandas proxy types (#14534) @shwina
  • Some additional kernel thread index refactoring. (#14107) @bdice

- C++
Published by raydouglass over 1 year ago

https://github.com/rapidsai/cudf - v24.04.01

🚨 Breaking Changes

  • Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
  • Change exceptions thrown by copying APIs (#15319) @vyasr
  • Change stringscolumnview::char_size to return int64 (#15197) @davidwendt
  • Upgrade to arrow-14.0.2 (#15108) @galipremsagar
  • Add support for pandas-2.2 in cudf (#15100) @galipremsagar
  • Deprecate cudf::hashing::sparkmurmurhash3x86_32 (#15074) @davidwendt
  • Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
  • Raise an error on import for unsupported GPUs. (#15053) @bdice
  • Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
  • Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
  • Add future_stack to DataFrame.stack (#15015) @galipremsagar
  • Deprecate groupby fillna (#15000) @mroeschke
  • Deprecate replace with categorical columns (#14988) @mroeschke
  • Deprecate delimwhitespace in readcsv for pandas 2.2 (#14986) @mroeschke
  • Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
  • Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
  • Add pandas-2.x support in cudf (#14916) @galipremsagar
  • Use cuco::static_set in the hash-based groupby (#14813) @PointKernel

πŸ› Bug Fixes

  • Fix an issue with creating a series from scalar when dtype=&#39;category&#39; (#15476) @galipremsagar
  • Update pre-commit-hooks to v0.0.3 (#15355) @KyleFromNVIDIA
  • [BUG][JNI] Trigger MemoryBuffer.onClosed after memory is freed (#15351) @abellina
  • Fix an issue with multiple short list rowgroups using the Parquet chunked reader. (#15342) @nvdbaranec
  • Avoid importing dask-expr if "query-planning" config is False (#15340) @rjzamora
  • Fix gtests/ERROR_TEST errors when run in Debug (#15317) @davidwendt
  • Fix OOB read in inflate_kernel (#15309) @vuule
  • Work around a cuFile error when running CSV tests with memcheck (#15293) @vuule
  • Fix Doxygen upload directory (#15291) @KyleFromNVIDIA
  • Fix Doxygen check (#15289) @KyleFromNVIDIA
  • Reintroduce PANDASGE220 import (#15287) @wence-
  • Fix mean computation for the geometric distribution in the data generator (#15282) @vuule
  • Fix Parquet decimal64 stats (#15281) @etseidl
  • Make linking of nvtx3-cpp BUILDLOCALINTERFACE (#15271) @KyleFromNVIDIA
  • Workaround compute-sanitizer memcheck bug (#15259) @davidwendt
  • Cleanup hostdevice_vector and add more APIs (#15252) @ttnghia
  • Fix number of rows in randomly generated lists columns (#15248) @vuule
  • Fix wrong output for collect_list/collect_set of lists column (#15243) @ttnghia
  • Fix testchunkedPackTwoPasses to copy from the bounce buffer (#15220) @abellina
  • Fix accessing .columns by an external API (#15212) @galipremsagar
  • [JNI] Disable testChunkedPackTwoPasses for now (#15210) @abellina
  • Update labeler and codeowner configs for CMake files (#15208) @PointKernel
  • Avoid dict normalization in __dask_tokenize__ (#15187) @rjzamora
  • Fix memcheck error in distinct inner join (#15164) @PointKernel
  • Remove unneeded script parameters in testcppmemcheck.sh (#15158) @davidwendt
  • Fix ListColumn.to_pandas() to retain list type (#15155) @galipremsagar
  • Avoid factorization in MultiIndex.to_pandas (#15150) @mroeschke
  • Fix GroupBy.get_group and GroupBy.indices (#15143) @wence-
  • Remove const from range_window_bounds::_extent. (#15138) @mythrocks
  • DataFrame.columns = ... retains RangeIndex & set dtype (#15129) @mroeschke
  • Correctly handle output for GroupBy.apply when chunk results are reindexed series (#15109) @brandon-b-miller
  • Fix Series.groupby.shift with a MultiIndex (#15098) @mroeschke
  • Fix reductions when DataFrame has MulitIndex columns (#15097) @mroeschke
  • Fix deprecation warnings for deprecated hash() calls (#15095) @davidwendt
  • Add support for arrow large_string in cudf (#15093) @galipremsagar
  • Fix sort_values pytest failure with pandas-2.x regression (#15092) @galipremsagar
  • Resolve path parsing issues in get_json_object (#15082) @SurajAralihalli
  • Fix bugs in handling of delta encodings (#15075) @etseidl
  • Fix is_device_write_preferred in void_sink and user_sink_wrapper (#15064) @vuule
  • Eliminate duplicate allocation of nested string columns (#15061) @vuule
  • Raise an error on import for unsupported GPUs. (#15053) @bdice
  • Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
  • Fix Index.difference to handle duplicate values when one of the inputs is empty (#15016) @galipremsagar
  • Add future_stack to DataFrame.stack (#15015) @galipremsagar
  • Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
  • Fix DataFrame.sort_index to respect ignore_index on all axis (#14995) @galipremsagar
  • Raise for pyarrow array that is tz-aware (#14980) @mroeschke
  • Direct SeriesGroupBy.aggregate to SeriesGroupBy.agg (#14971) @rjzamora
  • Respect IntervalDtype and CategoricalDtype objects passed by users (#14961) @mroeschke
  • unset CUDF_SPILL after a pytest (#14958) @galipremsagar
  • Fix Null literals to be not parsed as string when mixed types as string is enabled in JSON reader (#14939) @karthikeyann
  • Fix chunked reads of Parquet delta encoded pages (#14921) @etseidl
  • Fix reading offset for data stream in ORC reader (#14911) @ttnghia
  • Enable sanitizer check for a test case testORCReadAndWriteForDecimal128 (#14897) @res-life
  • Fix dask token normalization (#14829) @rjzamora
  • Fix 24.04 versions (#14825) @raydouglass
  • Ensure slow private attrs are maybe proxies (#14380) @mroeschke

πŸ“– Documentation

  • Ignore DLManagedTensor in the docs build (#15392) @davidwendt
  • Revert "Temporarily disable docs errors. (#15265)" (#15269) @bdice
  • Temporarily disable docs errors. (#15265) @bdice
  • Update developer_guide.md with new guidance on quoted internal includes (#15238) @harrism
  • Fix broken link for developer guide (#15025) @sanjana098
  • [DOC] Update typo in docs example of structscolumnwrapper (#14949) @karthikeyann
  • Update cudf.pandas FAQ. (#14940) @bdice
  • Optimize doc builds (#14856) @vyasr
  • Add developer guideline to use east const. (#14836) @bdice
  • Document how cuDF is pronounced (#14753) @pentschev
  • Notes convert to Pandas-compat (#12641) @Touutae-lab

πŸš€ New Features

  • Address inconsistency in single quote normalization in JSON reader (#15324) @shrshi
  • Use JNI pinned pool resource with cuIO (#15255) @abellina
  • Add DELTABYTEARRAY encoder for Parquet (#15239) @etseidl
  • Migrate filling operations to pylibcudf (#15225) @brandon-b-miller
  • [JNI] rmm based pinned pool (#15219) @abellina
  • Implement zero-copy host buffer source instead of using an arrow implementation (#15189) @vuule
  • Enable creation of columns from scalar (#15181) @vyasr
  • Use NVTX from GitHub. (#15178) @bdice
  • Implement segmented_row_bit_count for computing row sizes by segments of rows (#15169) @ttnghia
  • Implement search using pylibcudf (#15166) @vyasr
  • Add distinct left join (#15149) @PointKernel
  • Add cardinality control for groupby benchs with flat types (#15134) @PointKernel
  • Add ability to request Parquet encodings on a per-column basis (#15081) @etseidl
  • Automate include grouping order in .clang-format (#15063) @harrism
  • Requesting a clean build directory also clears Jitify cache (#15052) @robertmaynard
  • API for JSON unquoted whitespace normalization (#15033) @shrshi
  • Implement concatenate, lists.explode, merge, sorting, and stream compaction in pylibcudf (#15011) @vyasr
  • Implement replace in pylibcudf (#15005) @vyasr
  • Add distinct key inner join (#14990) @PointKernel
  • Implement rolling in pylibcudf (#14982) @vyasr
  • Implement joins in pylibcudf (#14972) @vyasr
  • Implement scans and reductions in pylibcudf (#14970) @vyasr
  • Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
  • Implement groupby in pylibcudf (#14945) @vyasr
  • Support casting of Map type to string in JSON reader (#14936) @karthikeyann
  • POC for whitespace removal in input JSON data using FST (#14931) @shrshi
  • Support for LZ4 compression in ORC and Parquet (#14906) @vuule
  • Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
  • Migrate unary operations to pylibcudf (#14850) @vyasr
  • Migrate binary operations to pylibcudf (#14821) @vyasr
  • Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
  • Support CUDA 12.2 (#14712) @jameslamb

πŸ› οΈ Improvements

  • Backport: Relax protobuf lower bound to 3.20. (#15506) (#15610) @bdice
  • Use conda env create --yes instead of --force (#15403) @bdice
  • Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
  • Change exceptions thrown by copying APIs (#15319) @vyasr
  • Enable branch testing for cudf.pandas (#15316) @galipremsagar
  • Replace black with ruff-format (#15312) @mroeschke
  • This fixes an NPE when trying to read empty JSON data by adding a new API for missing information (#15307) @revans2
  • Address poor performance of Parquet string decoding (#15304) @etseidl
  • Update script input name (#15301) @AyodeAwe
  • Make testreadparquetpartitionedfiltered data deterministic (#15296) @mroeschke
  • Add timeout for cudf.pandas pandas tests (#15284) @galipremsagar
  • Add upper bound to prevent usage of NumPy 2 (#15283) @bdice
  • Fix cudf::test::tohost return of hostvector (#15263) @davidwendt
  • Implement grouped product scan (#15254) @wence-
  • Add CUDA 12.4 to supported PTX versions (#15247) @brandon-b-miller
  • Implement DataFrame|Series.squeeze (#15244) @mroeschke
  • Roll back ipow changes due to register pressure. (#15242) @pmattione-nvidia
  • Remove createcharschild_column utility (#15241) @davidwendt
  • Update dlpack to version 0.8 (#15237) @dantegd
  • Improve performance in JSON reader when mixed_types_as_string option is enabled (#15236) @shrshi
  • Remove row conversion code from libcudf (#15234) @ttnghia
  • Use variable substitution for RAPIDS version in Doxyfile (#15231) @KyleFromNVIDIA
  • Add ListColumns.topandas(arrowtype=) (#15228) @mroeschke
  • Treat dask-cudf CI artifacts as pure wheels (#15223) @bdice
  • Clean up usage of CUDA_ARCH and other macros. (#15218) @bdice
  • DOC: use constants in performance-comparisons.ipynb (#15215) @raybellwaves
  • Rewrite conversion in terms of column (#15213) @vyasr
  • Switch pytest-xdist algo to worksteal (#15207) @galipremsagar
  • Deprecate stringscolumnview::offsets_begin() (#15205) @davidwendt
  • Add get_upstream_resource method to stream_checking_resource_adaptor (#15203) @miscco
  • Tune up row size estimation in the data generator (#15202) @vuule
  • Fix offset value for generating test data in parquet_chunked_reader_test.cu (#15200) @ttnghia
  • Change stringscolumnview::char_size to return int64 (#15197) @davidwendt
  • Fix includes for row_operators.cuh (#15194) @davidwendt
  • Generalize GHA selectors for pure Python testing (#15191) @bdice
  • Improvements for __cuda_array_interface__ tests (#15188) @bdice
  • Allow to_pandas to return pandas.ArrowDtype (#15182) @mroeschke
  • Ignore byte_range in read_json when the size is not smaller than the input data (#15180) @vuule
  • Expose new stablesort and finish streamcompaction in pylibcudf (#15175) @wence-
  • [ci] update matrix filters for dask-cudf builds (#15174) @jameslamb
  • Change makestringschildren to return uvector (#15171) @davidwendt
  • Don't override to_pandas for Datelike columns (#15167) @mroeschke
  • Drop python-snappy from dependencies. (#15161) @bdice
  • Add microkernels for fixed-width and fixed-width dictionary in Parquet decode (#15159) @abellina
  • Make HostColumnVector.DataType accessor methods public (#15157) @jbrennan333
  • Java bindings for left outer distinct join (#15154) @jlowe
  • Forward-merge branch-24.02 to branch-24.04 (#15153) @bdice
  • Enable pandas pytests for cudf.pandas (#15147) @galipremsagar
  • Add java option to keep quotes for JSON reads (#15146) @revans2
  • Change cross-pandas-version testing in cudf (#15145) @galipremsagar
  • Use hostdevice_vector in kernel_error to avoid the pageable copy (#15140) @vuule
  • Clean up Columns.astype & cudf.dtype (#15125) @mroeschke
  • Simplify some to_pandas implementations (#15123) @mroeschke
  • Java: Add leak tracking for Scalar instances (#15121) @jlowe
  • Remove calls to stringscolumnview::offsets_begin() (#15112) @davidwendt
  • Add support for Python 3.11, require NumPy 1.23+ (#15111) @jameslamb
  • Compile-time ipow computation with array lookup (#15110) @pmattione-nvidia
  • Upgrade to arrow-14.0.2 (#15108) @galipremsagar
  • Dynamically set version in RAPIDS doc builds (#15101) @jakirkham
  • Add support for pandas-2.2 in cudf (#15100) @galipremsagar
  • Update devcontainers to CUDA Toolkit 12.2 (#15099) @trxcllnt
  • Fix datetime binop pytest failures in pandas-2.2 (#15090) @galipremsagar
  • Validate types in pylibcudf Column/Table constructors (#15088) @wence-
  • xfail testjoinorderingpandascompat for pandas 2.2 (#15080) @mroeschke
  • Add general purpose host memory allocator reference to cuIO with a demo of pooled-pinned allocation. (#15079) @nvdbaranec
  • Adjust test_binops for pandas 2.2 (#15078) @mroeschke
  • Remove offsetsbegin() call from nvtext::generatengrams (#15077) @davidwendt
  • Use offsetalator in cudf::detail::hasnonemptynull_rows (#15076) @davidwendt
  • Deprecate cudf::hashing::sparkmurmurhash3x86_32 (#15074) @davidwendt
  • Fix cudf::test::to_host to handle both offset types for strings columns (#15073) @davidwendt
  • Add condition for testgroupbynulls_basic in pandas 2.2 (#15072) @mroeschke
  • xfail tests in testudfmasked_ops due to pandas 2.2 bug (#15071) @mroeschke
  • target branch-24.04 for GitHub Actions workflows (#15069) @jameslamb
  • Implement stable version of cudf::sort (#15066) @wence-
  • Fix ORC and JSON tests failures for pandas 2.2 (#15062) @mroeschke
  • Adjust test_joining for pandas 2.2 (#15060) @mroeschke
  • Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
  • Fix test_resample index dtype checking for pandas 2.2 (#15058) @mroeschke
  • Split out strings/replace.cu and rework its gtests (#15054) @davidwendt
  • Avoid incompatible value type setting in test_rolling for pandas 2.2 (#15050) @mroeschke
  • Change chained replace inplace test to COW test for pandas 2.2 (#15049) @mroeschke
  • Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
  • Avoid chained indexing in test_indexing for pandas 2.2 (#15045) @mroeschke
  • Avoid pandas 2.2 DeprecationWarning in test_hdf (#15044) @mroeschke
  • Use appropriate makeoffsetschild_column for building lists columns (#15043) @davidwendt
  • Factor out position-offsets logic from strings split_helper utility (#15040) @davidwendt
  • Forward-merge branch-24.02 to branch-24.04 (#15039) @bdice
  • Clean up nvtx macros (#15038) @PointKernel
  • Add xfailures for test_applymap for pandas 2.2 (#15034) @mroeschke
  • Expose libcudf filter expression in read_parquet (#15028) @wence-
  • Adjust tests in test_dataframe.py for pandas 2.2 (#15023) @mroeschke
  • Adjust testdatetimeinfer_format for pandas 2.2 (#15021) @mroeschke
  • Performance optimizations for parquet sub-rowgroup reader. (#15020) @nvdbaranec
  • JNI bindings for distincthashjoin (#15019) @jlowe
  • Change copyifsafe to call thrust instead of the overload function (#15018) @davidwendt
  • Improve performance of copyifelse for long strings (#15017) @davidwendt
  • Fix isstringdtype test for pandas 2.2 (#15012) @mroeschke
  • Rework cudf::strings::detail::copy_range for offsetalator (#15010) @davidwendt
  • Use offsetalator in cudf::getjsonobject() (#15009) @davidwendt
  • Align integral types in ORC to specs (#15008) @vuule
  • Clean up detail sequence header inclusion (#15007) @PointKernel
  • Add groupby.apply(include_groups=) to match pandas 2.2 deprecation (#15006) @mroeschke
  • Use offsetalator in cudf::interleave_columns() (#15004) @davidwendt
  • Use offsetalator in cudf::rowbitcount() (#15003) @davidwendt
  • Use offsetalator in cudf::strings::wrap() (#15002) @davidwendt
  • Use offsetalator in cudf::strings::reverse (#15001) @davidwendt
  • Deprecate groupby fillna (#15000) @mroeschke
  • Ensure to_* IO methods respect pandas 2.2 keyword only deprecation (#14999) @mroeschke
  • Remove unneeded calls to createcharschild_column utility (#14997) @davidwendt
  • Add environment-agnostic scripts for running ctests and pytests (#14992) @trxcllnt
  • Filter all DeprecationWarning's by ArrowTable.to_pandas() (#14989) @galipremsagar
  • Deprecate replace with categorical columns (#14988) @mroeschke
  • Deprecate delimwhitespace in readcsv for pandas 2.2 (#14986) @mroeschke
  • Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
  • Ensure that ctest is called with --no-tests=error. (#14983) @bdice
  • Deprecate non-integer periods in date_range and interval_range (#14976) @galipremsagar
  • Update ops-bot.yaml (#14974) @AyodeAwe
  • Use page statistics in Parquet reader (#14973) @etseidl
  • Use fused types for overloaded function signatures (#14969) @vyasr
  • Deprecate certain frequency strings (#14967) @galipremsagar
  • Update copyrights for 24.04. (#14964) @bdice
  • Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
  • Introduce GetJsonObjectOptions in getJSONObject Java API (#14956) @SurajAralihalli
  • JNI JSON read with DataSource and infered schema, along with basic java nested Schema JSON reads (#14954) @revans2
  • Make codecov only informational (always pass). (#14952) @bdice
  • Replace legacy cudf and dask_cudf imports as (d)gd (#14944) @mroeschke
  • Replace isdatetime64tz/interval_dtype with isinstance (#14943) @mroeschke
  • Update tests for pandas 2. (#14941) @bdice
  • Use more public pandas APIs (#14929) @mroeschke
  • Replace local copyright check with pre-commit-hooks verify-copyright (#14917) @KyleFromNVIDIA
  • Add pandas-2.x support in cudf (#14916) @galipremsagar
  • Use offsetalator in nvtext::bytepairencoding (#14888) @davidwendt
  • De-DOS line-endings (#14880) @wence-
  • Add detail cuco_allocator (#14877) @PointKernel
  • Move all core types to using enum class in Cython (#14876) @vyasr
  • Read cudf.__version__ in Sphinx build (#14872) @KyleFromNVIDIA
  • Use int64 offset types for accessing code-points in nvtext::normalize (#14868) @davidwendt
  • Read version from VERSION file in CMake (#14867) @KyleFromNVIDIA
  • Update conda-cpp-post-build-checks to branch-24.04. (#14854) @bdice
  • Update cudf for compatibility with the latest cuco (#14849) @PointKernel
  • Remove deprecated strings functions (#14848) @davidwendt
  • Fix CI workflows for pandas-tests and add test summary. (#14847) @bdice
  • Use offsetalator in cudf::strings::copy_slice (#14844) @davidwendt
  • Fix V2 Parquet page alignment for use with zStandard compression (#14841) @etseidl
  • Fix calls to deprecated strings factory API in examples. (#14838) @bdice
  • Update pre-commit hooks (#14837) @bdice
  • Use rapids_cuda_set_runtime to determine cuda runtime usage by target (#14833) @vyasr
  • Remove getmeminfo functions from custom memory resources (#14832) @harrism
  • Fix debug build by splitting rowoperatortests_utilities.cu (#14826) @davidwendt
  • Remove -DNVBenchENABLECUPTI=OFF. (#14820) @bdice
  • Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
  • Branch 24.04 merge branch 24.02 (#14809) @vyasr
  • Branch 24.04 merge branch 24.02 (#14806) @vyasr
  • Introduce basic "cudf" backend for Dask Expressions (#14805) @rjzamora
  • Remove build_struct|list_column (#14786) @mroeschke
  • Use offsetalator in nvtext tokenize functions (#14783) @davidwendt
  • Reduce execution time of Python ORC tests (#14776) @vuule
  • Use offsetalator in cudf::strings::split functions (#14757) @davidwendt
  • Use offsetalator in cudf::strings::findall (#14745) @davidwendt
  • Use offsetalator in cudf::strings::url_decode (#14744) @davidwendt
  • Use getoffsetvalue utility in strings shift function (#14743) @davidwendt
  • Use as_column instead of full (#14698) @mroeschke
  • List all notable breaking changes (#13535) @galipremsagar

- C++
Published by raydouglass almost 2 years ago

https://github.com/rapidsai/cudf - v24.04.00

🚨 Breaking Changes

  • Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
  • Change exceptions thrown by copying APIs (#15319) @vyasr
  • Change stringscolumnview::char_size to return int64 (#15197) @davidwendt
  • Upgrade to arrow-14.0.2 (#15108) @galipremsagar
  • Add support for pandas-2.2 in cudf (#15100) @galipremsagar
  • Deprecate cudf::hashing::sparkmurmurhash3x86_32 (#15074) @davidwendt
  • Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
  • Raise an error on import for unsupported GPUs. (#15053) @bdice
  • Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
  • Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
  • Add future_stack to DataFrame.stack (#15015) @galipremsagar
  • Deprecate groupby fillna (#15000) @mroeschke
  • Deprecate replace with categorical columns (#14988) @mroeschke
  • Deprecate delimwhitespace in readcsv for pandas 2.2 (#14986) @mroeschke
  • Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
  • Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
  • Add pandas-2.x support in cudf (#14916) @galipremsagar
  • Use cuco::static_set in the hash-based groupby (#14813) @PointKernel

πŸ› Bug Fixes

  • Fix an issue with creating a series from scalar when dtype=&#39;category&#39; (#15476) @galipremsagar
  • Update pre-commit-hooks to v0.0.3 (#15355) @KyleFromNVIDIA
  • [BUG][JNI] Trigger MemoryBuffer.onClosed after memory is freed (#15351) @abellina
  • Fix an issue with multiple short list rowgroups using the Parquet chunked reader. (#15342) @nvdbaranec
  • Avoid importing dask-expr if "query-planning" config is False (#15340) @rjzamora
  • Fix gtests/ERROR_TEST errors when run in Debug (#15317) @davidwendt
  • Fix OOB read in inflate_kernel (#15309) @vuule
  • Work around a cuFile error when running CSV tests with memcheck (#15293) @vuule
  • Fix Doxygen upload directory (#15291) @KyleFromNVIDIA
  • Fix Doxygen check (#15289) @KyleFromNVIDIA
  • Reintroduce PANDASGE220 import (#15287) @wence-
  • Fix mean computation for the geometric distribution in the data generator (#15282) @vuule
  • Fix Parquet decimal64 stats (#15281) @etseidl
  • Make linking of nvtx3-cpp BUILDLOCALINTERFACE (#15271) @KyleFromNVIDIA
  • Workaround compute-sanitizer memcheck bug (#15259) @davidwendt
  • Cleanup hostdevice_vector and add more APIs (#15252) @ttnghia
  • Fix number of rows in randomly generated lists columns (#15248) @vuule
  • Fix wrong output for collect_list/collect_set of lists column (#15243) @ttnghia
  • Fix testchunkedPackTwoPasses to copy from the bounce buffer (#15220) @abellina
  • Fix accessing .columns by an external API (#15212) @galipremsagar
  • [JNI] Disable testChunkedPackTwoPasses for now (#15210) @abellina
  • Update labeler and codeowner configs for CMake files (#15208) @PointKernel
  • Avoid dict normalization in __dask_tokenize__ (#15187) @rjzamora
  • Fix memcheck error in distinct inner join (#15164) @PointKernel
  • Remove unneeded script parameters in testcppmemcheck.sh (#15158) @davidwendt
  • Fix ListColumn.to_pandas() to retain list type (#15155) @galipremsagar
  • Avoid factorization in MultiIndex.to_pandas (#15150) @mroeschke
  • Fix GroupBy.get_group and GroupBy.indices (#15143) @wence-
  • Remove const from range_window_bounds::_extent. (#15138) @mythrocks
  • DataFrame.columns = ... retains RangeIndex & set dtype (#15129) @mroeschke
  • Correctly handle output for GroupBy.apply when chunk results are reindexed series (#15109) @brandon-b-miller
  • Fix Series.groupby.shift with a MultiIndex (#15098) @mroeschke
  • Fix reductions when DataFrame has MulitIndex columns (#15097) @mroeschke
  • Fix deprecation warnings for deprecated hash() calls (#15095) @davidwendt
  • Add support for arrow large_string in cudf (#15093) @galipremsagar
  • Fix sort_values pytest failure with pandas-2.x regression (#15092) @galipremsagar
  • Resolve path parsing issues in get_json_object (#15082) @SurajAralihalli
  • Fix bugs in handling of delta encodings (#15075) @etseidl
  • Fix is_device_write_preferred in void_sink and user_sink_wrapper (#15064) @vuule
  • Eliminate duplicate allocation of nested string columns (#15061) @vuule
  • Raise an error on import for unsupported GPUs. (#15053) @bdice
  • Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
  • Fix Index.difference to handle duplicate values when one of the inputs is empty (#15016) @galipremsagar
  • Add future_stack to DataFrame.stack (#15015) @galipremsagar
  • Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
  • Fix DataFrame.sort_index to respect ignore_index on all axis (#14995) @galipremsagar
  • Raise for pyarrow array that is tz-aware (#14980) @mroeschke
  • Direct SeriesGroupBy.aggregate to SeriesGroupBy.agg (#14971) @rjzamora
  • Respect IntervalDtype and CategoricalDtype objects passed by users (#14961) @mroeschke
  • unset CUDF_SPILL after a pytest (#14958) @galipremsagar
  • Fix Null literals to be not parsed as string when mixed types as string is enabled in JSON reader (#14939) @karthikeyann
  • Fix chunked reads of Parquet delta encoded pages (#14921) @etseidl
  • Fix reading offset for data stream in ORC reader (#14911) @ttnghia
  • Enable sanitizer check for a test case testORCReadAndWriteForDecimal128 (#14897) @res-life
  • Fix dask token normalization (#14829) @rjzamora
  • Fix 24.04 versions (#14825) @raydouglass
  • Ensure slow private attrs are maybe proxies (#14380) @mroeschke

πŸ“– Documentation

  • Ignore DLManagedTensor in the docs build (#15392) @davidwendt
  • Revert "Temporarily disable docs errors. (#15265)" (#15269) @bdice
  • Temporarily disable docs errors. (#15265) @bdice
  • Update developer_guide.md with new guidance on quoted internal includes (#15238) @harrism
  • Fix broken link for developer guide (#15025) @sanjana098
  • [DOC] Update typo in docs example of structscolumnwrapper (#14949) @karthikeyann
  • Update cudf.pandas FAQ. (#14940) @bdice
  • Optimize doc builds (#14856) @vyasr
  • Add developer guideline to use east const. (#14836) @bdice
  • Document how cuDF is pronounced (#14753) @pentschev
  • Notes convert to Pandas-compat (#12641) @Touutae-lab

πŸš€ New Features

  • Address inconsistency in single quote normalization in JSON reader (#15324) @shrshi
  • Use JNI pinned pool resource with cuIO (#15255) @abellina
  • Add DELTABYTEARRAY encoder for Parquet (#15239) @etseidl
  • Migrate filling operations to pylibcudf (#15225) @brandon-b-miller
  • [JNI] rmm based pinned pool (#15219) @abellina
  • Implement zero-copy host buffer source instead of using an arrow implementation (#15189) @vuule
  • Enable creation of columns from scalar (#15181) @vyasr
  • Use NVTX from GitHub. (#15178) @bdice
  • Implement segmented_row_bit_count for computing row sizes by segments of rows (#15169) @ttnghia
  • Implement search using pylibcudf (#15166) @vyasr
  • Add distinct left join (#15149) @PointKernel
  • Add cardinality control for groupby benchs with flat types (#15134) @PointKernel
  • Add ability to request Parquet encodings on a per-column basis (#15081) @etseidl
  • Automate include grouping order in .clang-format (#15063) @harrism
  • Requesting a clean build directory also clears Jitify cache (#15052) @robertmaynard
  • API for JSON unquoted whitespace normalization (#15033) @shrshi
  • Implement concatenate, lists.explode, merge, sorting, and stream compaction in pylibcudf (#15011) @vyasr
  • Implement replace in pylibcudf (#15005) @vyasr
  • Add distinct key inner join (#14990) @PointKernel
  • Implement rolling in pylibcudf (#14982) @vyasr
  • Implement joins in pylibcudf (#14972) @vyasr
  • Implement scans and reductions in pylibcudf (#14970) @vyasr
  • Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
  • Implement groupby in pylibcudf (#14945) @vyasr
  • Support casting of Map type to string in JSON reader (#14936) @karthikeyann
  • POC for whitespace removal in input JSON data using FST (#14931) @shrshi
  • Support for LZ4 compression in ORC and Parquet (#14906) @vuule
  • Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
  • Migrate unary operations to pylibcudf (#14850) @vyasr
  • Migrate binary operations to pylibcudf (#14821) @vyasr
  • Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
  • Support CUDA 12.2 (#14712) @jameslamb

πŸ› οΈ Improvements

  • Use conda env create --yes instead of --force (#15403) @bdice
  • Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
  • Change exceptions thrown by copying APIs (#15319) @vyasr
  • Enable branch testing for cudf.pandas (#15316) @galipremsagar
  • Replace black with ruff-format (#15312) @mroeschke
  • This fixes an NPE when trying to read empty JSON data by adding a new API for missing information (#15307) @revans2
  • Address poor performance of Parquet string decoding (#15304) @etseidl
  • Update script input name (#15301) @AyodeAwe
  • Make testreadparquetpartitionedfiltered data deterministic (#15296) @mroeschke
  • Add timeout for cudf.pandas pandas tests (#15284) @galipremsagar
  • Add upper bound to prevent usage of NumPy 2 (#15283) @bdice
  • Fix cudf::test::tohost return of hostvector (#15263) @davidwendt
  • Implement grouped product scan (#15254) @wence-
  • Add CUDA 12.4 to supported PTX versions (#15247) @brandon-b-miller
  • Implement DataFrame|Series.squeeze (#15244) @mroeschke
  • Roll back ipow changes due to register pressure. (#15242) @pmattione-nvidia
  • Remove createcharschild_column utility (#15241) @davidwendt
  • Update dlpack to version 0.8 (#15237) @dantegd
  • Improve performance in JSON reader when mixed_types_as_string option is enabled (#15236) @shrshi
  • Remove row conversion code from libcudf (#15234) @ttnghia
  • Use variable substitution for RAPIDS version in Doxyfile (#15231) @KyleFromNVIDIA
  • Add ListColumns.topandas(arrowtype=) (#15228) @mroeschke
  • Treat dask-cudf CI artifacts as pure wheels (#15223) @bdice
  • Clean up usage of CUDA_ARCH and other macros. (#15218) @bdice
  • DOC: use constants in performance-comparisons.ipynb (#15215) @raybellwaves
  • Rewrite conversion in terms of column (#15213) @vyasr
  • Switch pytest-xdist algo to worksteal (#15207) @galipremsagar
  • Deprecate stringscolumnview::offsets_begin() (#15205) @davidwendt
  • Add get_upstream_resource method to stream_checking_resource_adaptor (#15203) @miscco
  • Tune up row size estimation in the data generator (#15202) @vuule
  • Fix offset value for generating test data in parquet_chunked_reader_test.cu (#15200) @ttnghia
  • Change stringscolumnview::char_size to return int64 (#15197) @davidwendt
  • Fix includes for row_operators.cuh (#15194) @davidwendt
  • Generalize GHA selectors for pure Python testing (#15191) @bdice
  • Improvements for __cuda_array_interface__ tests (#15188) @bdice
  • Allow to_pandas to return pandas.ArrowDtype (#15182) @mroeschke
  • Ignore byte_range in read_json when the size is not smaller than the input data (#15180) @vuule
  • Expose new stablesort and finish streamcompaction in pylibcudf (#15175) @wence-
  • [ci] update matrix filters for dask-cudf builds (#15174) @jameslamb
  • Change makestringschildren to return uvector (#15171) @davidwendt
  • Don't override to_pandas for Datelike columns (#15167) @mroeschke
  • Drop python-snappy from dependencies. (#15161) @bdice
  • Add microkernels for fixed-width and fixed-width dictionary in Parquet decode (#15159) @abellina
  • Make HostColumnVector.DataType accessor methods public (#15157) @jbrennan333
  • Java bindings for left outer distinct join (#15154) @jlowe
  • Forward-merge branch-24.02 to branch-24.04 (#15153) @bdice
  • Enable pandas pytests for cudf.pandas (#15147) @galipremsagar
  • Add java option to keep quotes for JSON reads (#15146) @revans2
  • Change cross-pandas-version testing in cudf (#15145) @galipremsagar
  • Use hostdevice_vector in kernel_error to avoid the pageable copy (#15140) @vuule
  • Clean up Columns.astype & cudf.dtype (#15125) @mroeschke
  • Simplify some to_pandas implementations (#15123) @mroeschke
  • Java: Add leak tracking for Scalar instances (#15121) @jlowe
  • Remove calls to stringscolumnview::offsets_begin() (#15112) @davidwendt
  • Add support for Python 3.11, require NumPy 1.23+ (#15111) @jameslamb
  • Compile-time ipow computation with array lookup (#15110) @pmattione-nvidia
  • Upgrade to arrow-14.0.2 (#15108) @galipremsagar
  • Dynamically set version in RAPIDS doc builds (#15101) @jakirkham
  • Add support for pandas-2.2 in cudf (#15100) @galipremsagar
  • Update devcontainers to CUDA Toolkit 12.2 (#15099) @trxcllnt
  • Fix datetime binop pytest failures in pandas-2.2 (#15090) @galipremsagar
  • Validate types in pylibcudf Column/Table constructors (#15088) @wence-
  • xfail testjoinorderingpandascompat for pandas 2.2 (#15080) @mroeschke
  • Add general purpose host memory allocator reference to cuIO with a demo of pooled-pinned allocation. (#15079) @nvdbaranec
  • Adjust test_binops for pandas 2.2 (#15078) @mroeschke
  • Remove offsetsbegin() call from nvtext::generatengrams (#15077) @davidwendt
  • Use offsetalator in cudf::detail::hasnonemptynull_rows (#15076) @davidwendt
  • Deprecate cudf::hashing::sparkmurmurhash3x86_32 (#15074) @davidwendt
  • Fix cudf::test::to_host to handle both offset types for strings columns (#15073) @davidwendt
  • Add condition for testgroupbynulls_basic in pandas 2.2 (#15072) @mroeschke
  • xfail tests in testudfmasked_ops due to pandas 2.2 bug (#15071) @mroeschke
  • target branch-24.04 for GitHub Actions workflows (#15069) @jameslamb
  • Implement stable version of cudf::sort (#15066) @wence-
  • Fix ORC and JSON tests failures for pandas 2.2 (#15062) @mroeschke
  • Adjust test_joining for pandas 2.2 (#15060) @mroeschke
  • Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
  • Fix test_resample index dtype checking for pandas 2.2 (#15058) @mroeschke
  • Split out strings/replace.cu and rework its gtests (#15054) @davidwendt
  • Avoid incompatible value type setting in test_rolling for pandas 2.2 (#15050) @mroeschke
  • Change chained replace inplace test to COW test for pandas 2.2 (#15049) @mroeschke
  • Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
  • Avoid chained indexing in test_indexing for pandas 2.2 (#15045) @mroeschke
  • Avoid pandas 2.2 DeprecationWarning in test_hdf (#15044) @mroeschke
  • Use appropriate makeoffsetschild_column for building lists columns (#15043) @davidwendt
  • Factor out position-offsets logic from strings split_helper utility (#15040) @davidwendt
  • Forward-merge branch-24.02 to branch-24.04 (#15039) @bdice
  • Clean up nvtx macros (#15038) @PointKernel
  • Add xfailures for test_applymap for pandas 2.2 (#15034) @mroeschke
  • Expose libcudf filter expression in read_parquet (#15028) @wence-
  • Adjust tests in test_dataframe.py for pandas 2.2 (#15023) @mroeschke
  • Adjust testdatetimeinfer_format for pandas 2.2 (#15021) @mroeschke
  • Performance optimizations for parquet sub-rowgroup reader. (#15020) @nvdbaranec
  • JNI bindings for distincthashjoin (#15019) @jlowe
  • Change copyifsafe to call thrust instead of the overload function (#15018) @davidwendt
  • Improve performance of copyifelse for long strings (#15017) @davidwendt
  • Fix isstringdtype test for pandas 2.2 (#15012) @mroeschke
  • Rework cudf::strings::detail::copy_range for offsetalator (#15010) @davidwendt
  • Use offsetalator in cudf::getjsonobject() (#15009) @davidwendt
  • Align integral types in ORC to specs (#15008) @vuule
  • Clean up detail sequence header inclusion (#15007) @PointKernel
  • Add groupby.apply(include_groups=) to match pandas 2.2 deprecation (#15006) @mroeschke
  • Use offsetalator in cudf::interleave_columns() (#15004) @davidwendt
  • Use offsetalator in cudf::rowbitcount() (#15003) @davidwendt
  • Use offsetalator in cudf::strings::wrap() (#15002) @davidwendt
  • Use offsetalator in cudf::strings::reverse (#15001) @davidwendt
  • Deprecate groupby fillna (#15000) @mroeschke
  • Ensure to_* IO methods respect pandas 2.2 keyword only deprecation (#14999) @mroeschke
  • Remove unneeded calls to createcharschild_column utility (#14997) @davidwendt
  • Add environment-agnostic scripts for running ctests and pytests (#14992) @trxcllnt
  • Filter all DeprecationWarning's by ArrowTable.to_pandas() (#14989) @galipremsagar
  • Deprecate replace with categorical columns (#14988) @mroeschke
  • Deprecate delimwhitespace in readcsv for pandas 2.2 (#14986) @mroeschke
  • Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
  • Ensure that ctest is called with --no-tests=error. (#14983) @bdice
  • Deprecate non-integer periods in date_range and interval_range (#14976) @galipremsagar
  • Update ops-bot.yaml (#14974) @AyodeAwe
  • Use page statistics in Parquet reader (#14973) @etseidl
  • Use fused types for overloaded function signatures (#14969) @vyasr
  • Deprecate certain frequency strings (#14967) @galipremsagar
  • Update copyrights for 24.04. (#14964) @bdice
  • Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
  • Introduce GetJsonObjectOptions in getJSONObject Java API (#14956) @SurajAralihalli
  • JNI JSON read with DataSource and infered schema, along with basic java nested Schema JSON reads (#14954) @revans2
  • Make codecov only informational (always pass). (#14952) @bdice
  • Replace legacy cudf and dask_cudf imports as (d)gd (#14944) @mroeschke
  • Replace isdatetime64tz/interval_dtype with isinstance (#14943) @mroeschke
  • Update tests for pandas 2. (#14941) @bdice
  • Use more public pandas APIs (#14929) @mroeschke
  • Replace local copyright check with pre-commit-hooks verify-copyright (#14917) @KyleFromNVIDIA
  • Add pandas-2.x support in cudf (#14916) @galipremsagar
  • Use offsetalator in nvtext::bytepairencoding (#14888) @davidwendt
  • De-DOS line-endings (#14880) @wence-
  • Add detail cuco_allocator (#14877) @PointKernel
  • Move all core types to using enum class in Cython (#14876) @vyasr
  • Read cudf.__version__ in Sphinx build (#14872) @KyleFromNVIDIA
  • Use int64 offset types for accessing code-points in nvtext::normalize (#14868) @davidwendt
  • Read version from VERSION file in CMake (#14867) @KyleFromNVIDIA
  • Update conda-cpp-post-build-checks to branch-24.04. (#14854) @bdice
  • Update cudf for compatibility with the latest cuco (#14849) @PointKernel
  • Remove deprecated strings functions (#14848) @davidwendt
  • Fix CI workflows for pandas-tests and add test summary. (#14847) @bdice
  • Use offsetalator in cudf::strings::copy_slice (#14844) @davidwendt
  • Fix V2 Parquet page alignment for use with zStandard compression (#14841) @etseidl
  • Fix calls to deprecated strings factory API in examples. (#14838) @bdice
  • Update pre-commit hooks (#14837) @bdice
  • Use rapids_cuda_set_runtime to determine cuda runtime usage by target (#14833) @vyasr
  • Remove getmeminfo functions from custom memory resources (#14832) @harrism
  • Fix debug build by splitting rowoperatortests_utilities.cu (#14826) @davidwendt
  • Remove -DNVBenchENABLECUPTI=OFF. (#14820) @bdice
  • Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
  • Branch 24.04 merge branch 24.02 (#14809) @vyasr
  • Branch 24.04 merge branch 24.02 (#14806) @vyasr
  • Introduce basic "cudf" backend for Dask Expressions (#14805) @rjzamora
  • Remove build_struct|list_column (#14786) @mroeschke
  • Use offsetalator in nvtext tokenize functions (#14783) @davidwendt
  • Reduce execution time of Python ORC tests (#14776) @vuule
  • Use offsetalator in cudf::strings::split functions (#14757) @davidwendt
  • Use offsetalator in cudf::strings::findall (#14745) @davidwendt
  • Use offsetalator in cudf::strings::url_decode (#14744) @davidwendt
  • Use getoffsetvalue utility in strings shift function (#14743) @davidwendt
  • Use as_column instead of full (#14698) @mroeschke
  • List all notable breaking changes (#13535) @galipremsagar

- C++
Published by raydouglass almost 2 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v24.06.00

πŸ”— Links

🚨 Breaking Changes

  • Remove deprecated strings offsets_begin (#15454) @davidwendt
  • Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
  • Remove deprecated hash() and sparkmurmurhash3x86_32() (#15375) @davidwendt
  • Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
  • [FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
  • Align date_range defaults with pandas, support tz (#15139) @mroeschke

πŸ› Bug Fixes

  • nanoarrow uses package override for proper pinned versions generation (#15515) @robertmaynard
  • Make improvements in pandas-test reporting (#15485) @galipremsagar
  • Fixed page data truncation in parquet writer under certain conditions. (#15474) @nvdbaranec
  • Only use data_type constructor with scale for decimal types (#15472) @wence-
  • Avoid "p2p" shuffle as a default when dask_cudf is imported (#15469) @rjzamora
  • Fix debug build errors from toarrowdevice_test.cpp (#15463) @davidwendt
  • Fix basenormalator::integersizeof_fn integer dispatch (#15457) @davidwendt
  • Allow consumers of static builds to find nanoarrow (#15456) @robertmaynard
  • Allow jit compilation when using a splayed CUDA toolkit (#15451) @robertmaynard
  • Test static builds in CI and fix nanoarrow configure (#15437) @vyasr
  • Fixes potential race in JSON parser when parsing JSON lines format and when recovering from invalid lines (#15419) @elstehle
  • Fix errors in chunked ORC writer when no tables were (successfully) written (#15393) @vuule
  • Support implicit array conversion with query-planning enabled (#15378) @rjzamora
  • Fix arrow-based round trip of empty dataframes (#15373) @wence-
  • Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
  • Remove boundscheck=False setting in cython files (#15362) @wence-
  • Patch dask-expr var logic in dask-cudf (#15347) @rjzamora
  • Fix for logical and syntactical errors in libcudf c++ examples (#15346) @mhaseeb123
  • Disable dask-expr in docs builds. (#15343) @bdice
  • Apply the cuFile error work around to data_sink as well (#15335) @vuule

πŸ“– Documentation

  • Add debug tips section to libcudf developer guide (#15329) @davidwendt

πŸš€ New Features

  • Introduce benchmark suite for JSON reader options (#15124) @shrshi
  • Add to_arrow_device function to cudf interop using nanoarrow (#15047) @zeroshade

πŸ› οΈ Improvements

  • Enable tests/scalar and test/series in cudf.pandas tests (#15486) @mroeschke
  • Avoid .ordered and .categories from being settable in CategoricalColumn and CategoricalDtype (#15475) @mroeschke
  • Ignore pandas tests for cudf.pandas that need motoserver (#15468) @mroeschke
  • Use cachedproperty for NumericColumn.nancount instead of .nancount variable (#15466) @mroeschke
  • Add custom status check workflow (#15464) @galipremsagar
  • Enable tests/strings/test_api.py and tests/io/pytables in cudf.pandas tests (#15461) @mroeschke
  • Remove deprecated strings offsets_begin (#15454) @davidwendt
  • Enable tests/windows/ in cudf.pandas tests (#15444) @mroeschke
  • Enable tests/interchange/test_impl.py in cudf.pandas tests (#15443) @mroeschke
  • Enable tests/io/testuseragent.py in cudf pandas tests (#15442) @mroeschke
  • Remove prior test skipping in run-pandas-tests with testing 2.2.1 (#15440) @mroeschke
  • Support orc and text IO with dask-expr using legacy conversion (#15439) @rjzamora
  • Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
  • Enable dask_cudf json and s3 tests with query-planning on (#15408) @rjzamora
  • Bump ruff and codespell pre-commit checks (#15407) @mroeschke
  • Enable all tests for arm arch (#15402) @galipremsagar
  • Remove deprecated hash() and sparkmurmurhash3x86_32() (#15375) @davidwendt
  • Enable test-reporting for pandas pytests in CI (#15369) @galipremsagar
  • Use logical types in Parquet reader (#15365) @etseidl
  • Forward-merge branch-24.04 to branch-24.06 (#15349) @bdice
  • Use ruff pydocstyle over pydocstyle pre-commit hook (#15345) @mroeschke
  • Refactor stream mode setup for gtests (#15337) @davidwendt
  • Avoid duplicate dask-cudf testing (#15333) @rjzamora
  • Update udfcpp to use rapidscpm_cccl. (#15331) @bdice
  • Forward-merge branch-24.04 into branch-24.06 skip ci @rapids-bot[bot]
  • Allow numeric_only=True for simple groupby reductions (#15326) @rjzamora
  • Drop CentOS 7 support. (#15323) @bdice
  • Rework cudf::findandreplaceall to use gather-based makestrings_column (#15305) @davidwendt
  • First pass at adding testing for pylibcudf (#15300) @vyasr
  • [FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
  • Rework cudf::replacenulls to use strings::detail::copyif_else (#15286) @davidwendt
  • Large strings support in cudf::concatenate (#15195) @davidwendt
  • Use less iscategorical_dtype (#15148) @mroeschke
  • Align date_range defaults with pandas, support tz (#15139) @mroeschke
  • ModuleAccelerator performance: cache the result of checking if a caller is in the denylist (#15056) @shwina
  • Use offsetalator in cudf::strings::replace functions (#14824) @davidwendt
  • Cleanup some timedelta/datetime column logic (#14715) @mroeschke
  • Refactor numpy array input in as_column (#14651) @mroeschke

- C++
Published by rapids-bot[bot] almost 2 years ago

https://github.com/rapidsai/cudf - v24.02.02

🚨 Breaking Changes

  • Remove **kwargs from astype (#14765) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Drop Pascal GPU support. (#14630) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Deprecate cudf::makestringscolumn accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • Switch to scikit-build-core (#13531) @vyasr

πŸ› Bug Fixes

  • Bump to nvcomp 3.0.6. (#15128) @bdice
  • [HOTFIX] Unpin numba<0.58 (#15031) @raydouglass
  • Exclude tests from builds (#14981) @vyasr
  • Fix the bounce buffer size in ORC writer (#14947) @vuule
  • Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
  • Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
  • Fix totalbytesize in Parquet row group metadata (#14802) @etseidl
  • Fix index difference to follow the pandas format (#14789) @amiralimi
  • Fix shared-workflows repo name (#14784) @raydouglass
  • Remove unparseable attributes from all nodes (#14780) @vyasr
  • Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
  • Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
  • Fix calls to deprecated strings factory API (#14771) @davidwendt
  • Fix ptx file discovery in editable installs (#14767) @vyasr
  • Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
  • Enable intermediate proxies to be picklable (#14752) @shwina
  • Add CUDFTESTPROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
  • Fix CMake args (#14746) @vyasr
  • Fix logic bug introduced in #14730 (#14742) @wence-
  • [Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
  • Fix Groupby.get_group (#14728) @rjzamora
  • Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
  • Split cuda versions for notebook testing (#14722) @raydouglass
  • Fix to_numeric not preserving Series index and name (#14718) @mroeschke
  • Update dask-cudf wheel name (#14713) @raydouglass
  • Fix strings::contains matching end of string target (#14711) @davidwendt
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
  • Potential fix for peformance regression in #14415 (#14706) @etseidl
  • Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
  • Skip numba test that fails on ARM (#14702) @brandon-b-miller
  • Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
  • Fix nanasnull not being respected when passing arrow object (#14688) @mroeschke
  • Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
  • Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
  • Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
  • Add row conversion code from spark-rapids-jni (#14664) @ttnghia
  • Unconditionally export the CCCL path (#14656) @vyasr
  • Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
  • Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
  • Fix invalid memory access in Parquet reader (#14637) @etseidl
  • Use columnempty over ascolumn([]) (#14632) @mroeschke
  • Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
  • Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
  • Remove non-empty nulls in cudf::getjsonobject (#14609) @davidwendt
  • Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
  • Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
  • Address potential race conditions in Parquet reader (#14602) @etseidl
  • Fix DataFrame.reindex removing column name (#14601) @mroeschke
  • Remove unsanitized input test data from copy gtests (#14600) @davidwendt
  • Fix race detected in Parquet writer (#14598) @etseidl
  • Correct invalid or missing return types (#14587) @robertmaynard
  • Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
  • Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
  • Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
  • Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
  • Fixes a symbol group lookup table issue (#14561) @elstehle
  • Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
  • REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
  • Improve memory footprint of isin by using contains (#14478) @wence-
  • Move creation of env.yaml outside the current directory (#14476) @davidwendt
  • Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
  • Correct dtype of count aggregations on empty dataframes (#14473) @wence-
  • Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
  • JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
  • Fix default stream use in the CSV reader (#14443) @vuule
  • Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
  • Defer PTX file load to runtime (#13690) @brandon-b-miller

πŸ“– Documentation

  • Disable parallel build (#14796) @vyasr
  • Add pylibcudf to the docs (#14791) @vyasr
  • Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
  • Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
  • More doxygen fixes (#14639) @vyasr
  • Enable doxygen XML generation and fix issues (#14477) @vyasr
  • Some doxygen improvements (#14469) @vyasr
  • Remove warning in dask-cudf docs (#14454) @wence-
  • Update README links with redirects. (#14378) @bdice
  • Add pip install instructions to README (#13677) @shwina

πŸš€ New Features

  • Add ci check for external kernels (#14768) @robertmaynard
  • JSON single quote normalization API (#14729) @shrshi
  • Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
  • Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
  • Don't constrain numba&lt;0.58 (#14616) @brandon-b-miller
  • Add DELTALENGTHBYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
  • JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
  • JSON quote normalization (#14545) @shrshi
  • Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
  • Implement more copying APIs in pylibcudf (#14508) @vyasr
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Parquet sub-rowgroup reading. (#14360) @nvdbaranec
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • PARQUET-2261 Size Statistics (#14000) @etseidl
  • Improve GroupBy JIT error handling (#13854) @brandon-b-miller
  • Generate unified Python/C++ docs (#13846) @vyasr
  • Expand JIT groupby test suite (#13813) @brandon-b-miller

πŸ› οΈ Improvements

  • Pin pytest&lt;8 (#14920) @galipremsagar
  • Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
  • Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
  • Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
  • Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
  • Remove **kwargs from astype (#14765) @mroeschke
  • fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
  • Add pynvjitlink as a dependency (#14763) @brandon-b-miller
  • Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
  • Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
  • Pin pytest-cases<3.8.2 (#14756) @mroeschke
  • Use fromdata instead of fromcolumns for initialzing Frame (#14755) @mroeschke
  • Consolidate cudf object handling in as_column (#14754) @mroeschke
  • Reduce execution time of Parquet C++ tests (#14750) @vuule
  • Implement to_datetime(..., utc=True) (#14749) @mroeschke
  • Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
  • Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
  • Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
  • Remove unused/single use methods (#14739) @mroeschke
  • refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
  • Remove unneeded methods in Column (#14730) @mroeschke
  • Clean up base column methods (#14725) @mroeschke
  • Ensure column.fillna signatures are consistent (#14724) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Replace asnumerical with asnumerical_column/codes (#14719) @mroeschke
  • Use offsetalator in gather_chars (#14700) @davidwendt
  • Use makestringschildren for fill() specialization logic (#14697) @davidwendt
  • Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
  • Fix call to deprecated factory function (#14695) @davidwendt
  • Use as_column instead of arange for range like inputs (#14689) @mroeschke
  • Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
  • Split parquet test into multiple files (#14663) @etseidl
  • Custom error messages for IO with nonexistent files (#14662) @vuule
  • Explicitly pass .dtype into isfoodtype functions (#14657) @mroeschke
  • Basic validation in reader benchmarks (#14647) @vuule
  • Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
  • Consolidate memoryview handling in as_column (#14643) @mroeschke
  • Convert FieldType to scoped enum (#14642) @vuule
  • Use instance over isfoodtype (#14641) @mroeschke
  • Use isinstance over isfoodtype internally (#14638) @mroeschke
  • Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
  • Drop nvbench patch for nvml. (#14631) @bdice
  • Drop Pascal GPU support. (#14630) @bdice
  • Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
  • Create strings-specific makeoffsetschild_column for multiple offset types (#14612) @davidwendt
  • Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
  • Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
  • Support freq in DatetimeIndex (#14593) @shwina
  • Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
  • Remove WORKSPACE env var from cudftest tempdirectory class (#14588) @davidwendt
  • Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
  • Use cuda::proclaimreturntype on device lambdas. (#14577) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Update dependencies.yaml to new pip index (#14575) @vyasr
  • Simplify Python CMake (#14565) @vyasr
  • Java expose parquet passreadlimit (#14564) @revans2
  • Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
  • Use cudftest tempdirectory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
  • Fix return type of prefix increment overloads (#14544) @vuule
  • Make bpemergepairs_impl member private (#14543) @davidwendt
  • Small clean up in io::statistics (#14542) @vuule
  • Change json gtest environment variable to compile-time definition (#14541) @davidwendt
  • Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
  • Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
  • Move non-templated inline function definitions from tableview.hpp to tableview.cpp (#14535) @davidwendt
  • Add JNI for strings::code_points (#14533) @thirtiseven
  • Add a test for issue 12773 (#14529) @vyasr
  • Split libarrow build dependencies. (#14506) @bdice
  • Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
  • Refactor Parquet kernel_error (#14464) @etseidl
  • Deprecate cudf::makestringscolumn accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
  • Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
  • Expose stream parameter in public nvtext APIs (#14456) @davidwendt
  • Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • Refactor cudf.Series.init (#14450) @mroeschke
  • Remove the use of volatile in Parquet (#14448) @vuule
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Testing stream pool implementation (#14437) @shrshi
  • Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
  • Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
  • Use isinstance(..., cudf.IntervalDtype) instead of isintervaldtype (#14424) @mroeschke
  • Use isinstance(..., cudf.CategoricalDtype) instead of iscategoricaldtype (#14423) @mroeschke
  • Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
  • REF: Remove instances of pd.core (#14421) @mroeschke
  • Expose streams in public filling APIs for label_bins (#14401) @ZelboK
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Limit DELTABINARYPACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
  • Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
  • Expose streams in Parquet reader and writer APIs (#14359) @shrshi
  • Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
  • Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
  • Expose streams in ORC reader and writer APIs (#14350) @shrshi
  • Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
  • Add cuDF devcontainers (#14015) @trxcllnt
  • Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
  • Switch to scikit-build-core (#13531) @vyasr
  • Simplify null count checking in column equality comparator (#13312) @vyasr

- C++
Published by raydouglass almost 2 years ago

https://github.com/rapidsai/cudf - v24.02.01

🚨 Breaking Changes

  • Remove **kwargs from astype (#14765) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Drop Pascal GPU support. (#14630) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Deprecate cudf::makestringscolumn accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • Switch to scikit-build-core (#13531) @vyasr

πŸ› Bug Fixes

  • [HOTFIX] Unpin numba<0.58 (#15031) @raydouglass
  • Exclude tests from builds (#14981) @vyasr
  • Fix the bounce buffer size in ORC writer (#14947) @vuule
  • Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
  • Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
  • Fix totalbytesize in Parquet row group metadata (#14802) @etseidl
  • Fix index difference to follow the pandas format (#14789) @amiralimi
  • Fix shared-workflows repo name (#14784) @raydouglass
  • Remove unparseable attributes from all nodes (#14780) @vyasr
  • Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
  • Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
  • Fix calls to deprecated strings factory API (#14771) @davidwendt
  • Fix ptx file discovery in editable installs (#14767) @vyasr
  • Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
  • Enable intermediate proxies to be picklable (#14752) @shwina
  • Add CUDFTESTPROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
  • Fix CMake args (#14746) @vyasr
  • Fix logic bug introduced in #14730 (#14742) @wence-
  • [Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
  • Fix Groupby.get_group (#14728) @rjzamora
  • Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
  • Split cuda versions for notebook testing (#14722) @raydouglass
  • Fix to_numeric not preserving Series index and name (#14718) @mroeschke
  • Update dask-cudf wheel name (#14713) @raydouglass
  • Fix strings::contains matching end of string target (#14711) @davidwendt
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
  • Potential fix for peformance regression in #14415 (#14706) @etseidl
  • Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
  • Skip numba test that fails on ARM (#14702) @brandon-b-miller
  • Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
  • Fix nanasnull not being respected when passing arrow object (#14688) @mroeschke
  • Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
  • Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
  • Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
  • Add row conversion code from spark-rapids-jni (#14664) @ttnghia
  • Unconditionally export the CCCL path (#14656) @vyasr
  • Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
  • Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
  • Fix invalid memory access in Parquet reader (#14637) @etseidl
  • Use columnempty over ascolumn([]) (#14632) @mroeschke
  • Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
  • Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
  • Remove non-empty nulls in cudf::getjsonobject (#14609) @davidwendt
  • Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
  • Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
  • Address potential race conditions in Parquet reader (#14602) @etseidl
  • Fix DataFrame.reindex removing column name (#14601) @mroeschke
  • Remove unsanitized input test data from copy gtests (#14600) @davidwendt
  • Fix race detected in Parquet writer (#14598) @etseidl
  • Correct invalid or missing return types (#14587) @robertmaynard
  • Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
  • Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
  • Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
  • Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
  • Fixes a symbol group lookup table issue (#14561) @elstehle
  • Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
  • REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
  • Improve memory footprint of isin by using contains (#14478) @wence-
  • Move creation of env.yaml outside the current directory (#14476) @davidwendt
  • Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
  • Correct dtype of count aggregations on empty dataframes (#14473) @wence-
  • Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
  • JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
  • Fix default stream use in the CSV reader (#14443) @vuule
  • Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
  • Defer PTX file load to runtime (#13690) @brandon-b-miller

πŸ“– Documentation

  • Disable parallel build (#14796) @vyasr
  • Add pylibcudf to the docs (#14791) @vyasr
  • Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
  • Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
  • More doxygen fixes (#14639) @vyasr
  • Enable doxygen XML generation and fix issues (#14477) @vyasr
  • Some doxygen improvements (#14469) @vyasr
  • Remove warning in dask-cudf docs (#14454) @wence-
  • Update README links with redirects. (#14378) @bdice
  • Add pip install instructions to README (#13677) @shwina

πŸš€ New Features

  • Add ci check for external kernels (#14768) @robertmaynard
  • JSON single quote normalization API (#14729) @shrshi
  • Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
  • Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
  • Don't constrain numba&lt;0.58 (#14616) @brandon-b-miller
  • Add DELTALENGTHBYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
  • JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
  • JSON quote normalization (#14545) @shrshi
  • Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
  • Implement more copying APIs in pylibcudf (#14508) @vyasr
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Parquet sub-rowgroup reading. (#14360) @nvdbaranec
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • PARQUET-2261 Size Statistics (#14000) @etseidl
  • Improve GroupBy JIT error handling (#13854) @brandon-b-miller
  • Generate unified Python/C++ docs (#13846) @vyasr
  • Expand JIT groupby test suite (#13813) @brandon-b-miller

πŸ› οΈ Improvements

  • Pin pytest&lt;8 (#14920) @galipremsagar
  • Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
  • Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
  • Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
  • Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
  • Remove **kwargs from astype (#14765) @mroeschke
  • fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
  • Add pynvjitlink as a dependency (#14763) @brandon-b-miller
  • Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
  • Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
  • Pin pytest-cases<3.8.2 (#14756) @mroeschke
  • Use fromdata instead of fromcolumns for initialzing Frame (#14755) @mroeschke
  • Consolidate cudf object handling in as_column (#14754) @mroeschke
  • Reduce execution time of Parquet C++ tests (#14750) @vuule
  • Implement to_datetime(..., utc=True) (#14749) @mroeschke
  • Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
  • Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
  • Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
  • Remove unused/single use methods (#14739) @mroeschke
  • refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
  • Remove unneeded methods in Column (#14730) @mroeschke
  • Clean up base column methods (#14725) @mroeschke
  • Ensure column.fillna signatures are consistent (#14724) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Replace asnumerical with asnumerical_column/codes (#14719) @mroeschke
  • Use offsetalator in gather_chars (#14700) @davidwendt
  • Use makestringschildren for fill() specialization logic (#14697) @davidwendt
  • Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
  • Fix call to deprecated factory function (#14695) @davidwendt
  • Use as_column instead of arange for range like inputs (#14689) @mroeschke
  • Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
  • Split parquet test into multiple files (#14663) @etseidl
  • Custom error messages for IO with nonexistent files (#14662) @vuule
  • Explicitly pass .dtype into isfoodtype functions (#14657) @mroeschke
  • Basic validation in reader benchmarks (#14647) @vuule
  • Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
  • Consolidate memoryview handling in as_column (#14643) @mroeschke
  • Convert FieldType to scoped enum (#14642) @vuule
  • Use instance over isfoodtype (#14641) @mroeschke
  • Use isinstance over isfoodtype internally (#14638) @mroeschke
  • Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
  • Drop nvbench patch for nvml. (#14631) @bdice
  • Drop Pascal GPU support. (#14630) @bdice
  • Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
  • Create strings-specific makeoffsetschild_column for multiple offset types (#14612) @davidwendt
  • Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
  • Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
  • Support freq in DatetimeIndex (#14593) @shwina
  • Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
  • Remove WORKSPACE env var from cudftest tempdirectory class (#14588) @davidwendt
  • Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
  • Use cuda::proclaimreturntype on device lambdas. (#14577) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Update dependencies.yaml to new pip index (#14575) @vyasr
  • Simplify Python CMake (#14565) @vyasr
  • Java expose parquet passreadlimit (#14564) @revans2
  • Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
  • Use cudftest tempdirectory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
  • Fix return type of prefix increment overloads (#14544) @vuule
  • Make bpemergepairs_impl member private (#14543) @davidwendt
  • Small clean up in io::statistics (#14542) @vuule
  • Change json gtest environment variable to compile-time definition (#14541) @davidwendt
  • Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
  • Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
  • Move non-templated inline function definitions from tableview.hpp to tableview.cpp (#14535) @davidwendt
  • Add JNI for strings::code_points (#14533) @thirtiseven
  • Add a test for issue 12773 (#14529) @vyasr
  • Split libarrow build dependencies. (#14506) @bdice
  • Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
  • Refactor Parquet kernel_error (#14464) @etseidl
  • Deprecate cudf::makestringscolumn accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
  • Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
  • Expose stream parameter in public nvtext APIs (#14456) @davidwendt
  • Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • Refactor cudf.Series.init (#14450) @mroeschke
  • Remove the use of volatile in Parquet (#14448) @vuule
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Testing stream pool implementation (#14437) @shrshi
  • Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
  • Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
  • Use isinstance(..., cudf.IntervalDtype) instead of isintervaldtype (#14424) @mroeschke
  • Use isinstance(..., cudf.CategoricalDtype) instead of iscategoricaldtype (#14423) @mroeschke
  • Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
  • REF: Remove instances of pd.core (#14421) @mroeschke
  • Expose streams in public filling APIs for label_bins (#14401) @ZelboK
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Limit DELTABINARYPACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
  • Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
  • Expose streams in Parquet reader and writer APIs (#14359) @shrshi
  • Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
  • Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
  • Expose streams in ORC reader and writer APIs (#14350) @shrshi
  • Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
  • Add cuDF devcontainers (#14015) @trxcllnt
  • Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
  • Switch to scikit-build-core (#13531) @vyasr
  • Simplify null count checking in column equality comparator (#13312) @vyasr

- C++
Published by raydouglass about 2 years ago

https://github.com/rapidsai/cudf - v24.02.00

🚨 Breaking Changes

  • Remove **kwargs from astype (#14765) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Drop Pascal GPU support. (#14630) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Deprecate cudf::makestringscolumn accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • Switch to scikit-build-core (#13531) @vyasr

πŸ› Bug Fixes

  • Exclude tests from builds (#14981) @vyasr
  • Fix the bounce buffer size in ORC writer (#14947) @vuule
  • Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
  • Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
  • Fix totalbytesize in Parquet row group metadata (#14802) @etseidl
  • Fix index difference to follow the pandas format (#14789) @amiralimi
  • Fix shared-workflows repo name (#14784) @raydouglass
  • Remove unparseable attributes from all nodes (#14780) @vyasr
  • Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
  • Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
  • Fix calls to deprecated strings factory API (#14771) @davidwendt
  • Fix ptx file discovery in editable installs (#14767) @vyasr
  • Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
  • Enable intermediate proxies to be picklable (#14752) @shwina
  • Add CUDFTESTPROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
  • Fix CMake args (#14746) @vyasr
  • Fix logic bug introduced in #14730 (#14742) @wence-
  • [Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
  • Fix Groupby.get_group (#14728) @rjzamora
  • Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
  • Split cuda versions for notebook testing (#14722) @raydouglass
  • Fix to_numeric not preserving Series index and name (#14718) @mroeschke
  • Update dask-cudf wheel name (#14713) @raydouglass
  • Fix strings::contains matching end of string target (#14711) @davidwendt
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
  • Potential fix for peformance regression in #14415 (#14706) @etseidl
  • Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
  • Skip numba test that fails on ARM (#14702) @brandon-b-miller
  • Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
  • Fix nanasnull not being respected when passing arrow object (#14688) @mroeschke
  • Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
  • Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
  • Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
  • Add row conversion code from spark-rapids-jni (#14664) @ttnghia
  • Unconditionally export the CCCL path (#14656) @vyasr
  • Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
  • Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
  • Fix invalid memory access in Parquet reader (#14637) @etseidl
  • Use columnempty over ascolumn([]) (#14632) @mroeschke
  • Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
  • Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
  • Remove non-empty nulls in cudf::getjsonobject (#14609) @davidwendt
  • Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
  • Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
  • Address potential race conditions in Parquet reader (#14602) @etseidl
  • Fix DataFrame.reindex removing column name (#14601) @mroeschke
  • Remove unsanitized input test data from copy gtests (#14600) @davidwendt
  • Fix race detected in Parquet writer (#14598) @etseidl
  • Correct invalid or missing return types (#14587) @robertmaynard
  • Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
  • Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
  • Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
  • Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
  • Fixes a symbol group lookup table issue (#14561) @elstehle
  • Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
  • REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
  • Improve memory footprint of isin by using contains (#14478) @wence-
  • Move creation of env.yaml outside the current directory (#14476) @davidwendt
  • Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
  • Correct dtype of count aggregations on empty dataframes (#14473) @wence-
  • Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
  • JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
  • Fix default stream use in the CSV reader (#14443) @vuule
  • Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
  • Defer PTX file load to runtime (#13690) @brandon-b-miller

πŸ“– Documentation

  • Disable parallel build (#14796) @vyasr
  • Add pylibcudf to the docs (#14791) @vyasr
  • Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
  • Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
  • More doxygen fixes (#14639) @vyasr
  • Enable doxygen XML generation and fix issues (#14477) @vyasr
  • Some doxygen improvements (#14469) @vyasr
  • Remove warning in dask-cudf docs (#14454) @wence-
  • Update README links with redirects. (#14378) @bdice
  • Add pip install instructions to README (#13677) @shwina

πŸš€ New Features

  • Add ci check for external kernels (#14768) @robertmaynard
  • JSON single quote normalization API (#14729) @shrshi
  • Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
  • Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
  • Don't constrain numba&lt;0.58 (#14616) @brandon-b-miller
  • Add DELTALENGTHBYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
  • JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
  • JSON quote normalization (#14545) @shrshi
  • Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
  • Implement more copying APIs in pylibcudf (#14508) @vyasr
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Parquet sub-rowgroup reading. (#14360) @nvdbaranec
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • PARQUET-2261 Size Statistics (#14000) @etseidl
  • Improve GroupBy JIT error handling (#13854) @brandon-b-miller
  • Generate unified Python/C++ docs (#13846) @vyasr
  • Expand JIT groupby test suite (#13813) @brandon-b-miller

πŸ› οΈ Improvements

  • Pin pytest&lt;8 (#14920) @galipremsagar
  • Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
  • Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
  • Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
  • Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
  • Remove **kwargs from astype (#14765) @mroeschke
  • fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
  • Add pynvjitlink as a dependency (#14763) @brandon-b-miller
  • Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
  • Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
  • Pin pytest-cases<3.8.2 (#14756) @mroeschke
  • Use fromdata instead of fromcolumns for initialzing Frame (#14755) @mroeschke
  • Consolidate cudf object handling in as_column (#14754) @mroeschke
  • Reduce execution time of Parquet C++ tests (#14750) @vuule
  • Implement to_datetime(..., utc=True) (#14749) @mroeschke
  • Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
  • Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
  • Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
  • Remove unused/single use methods (#14739) @mroeschke
  • refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
  • Remove unneeded methods in Column (#14730) @mroeschke
  • Clean up base column methods (#14725) @mroeschke
  • Ensure column.fillna signatures are consistent (#14724) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Replace asnumerical with asnumerical_column/codes (#14719) @mroeschke
  • Use offsetalator in gather_chars (#14700) @davidwendt
  • Use makestringschildren for fill() specialization logic (#14697) @davidwendt
  • Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
  • Fix call to deprecated factory function (#14695) @davidwendt
  • Use as_column instead of arange for range like inputs (#14689) @mroeschke
  • Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
  • Split parquet test into multiple files (#14663) @etseidl
  • Custom error messages for IO with nonexistent files (#14662) @vuule
  • Explicitly pass .dtype into isfoodtype functions (#14657) @mroeschke
  • Basic validation in reader benchmarks (#14647) @vuule
  • Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
  • Consolidate memoryview handling in as_column (#14643) @mroeschke
  • Convert FieldType to scoped enum (#14642) @vuule
  • Use instance over isfoodtype (#14641) @mroeschke
  • Use isinstance over isfoodtype internally (#14638) @mroeschke
  • Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
  • Drop nvbench patch for nvml. (#14631) @bdice
  • Drop Pascal GPU support. (#14630) @bdice
  • Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
  • Create strings-specific makeoffsetschild_column for multiple offset types (#14612) @davidwendt
  • Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
  • Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
  • Support freq in DatetimeIndex (#14593) @shwina
  • Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
  • Remove WORKSPACE env var from cudftest tempdirectory class (#14588) @davidwendt
  • Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
  • Use cuda::proclaimreturntype on device lambdas. (#14577) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Update dependencies.yaml to new pip index (#14575) @vyasr
  • Simplify Python CMake (#14565) @vyasr
  • Java expose parquet passreadlimit (#14564) @revans2
  • Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
  • Use cudftest tempdirectory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
  • Fix return type of prefix increment overloads (#14544) @vuule
  • Make bpemergepairs_impl member private (#14543) @davidwendt
  • Small clean up in io::statistics (#14542) @vuule
  • Change json gtest environment variable to compile-time definition (#14541) @davidwendt
  • Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
  • Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
  • Move non-templated inline function definitions from tableview.hpp to tableview.cpp (#14535) @davidwendt
  • Add JNI for strings::code_points (#14533) @thirtiseven
  • Add a test for issue 12773 (#14529) @vyasr
  • Split libarrow build dependencies. (#14506) @bdice
  • Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
  • Refactor Parquet kernel_error (#14464) @etseidl
  • Deprecate cudf::makestringscolumn accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
  • Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
  • Expose stream parameter in public nvtext APIs (#14456) @davidwendt
  • Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • Refactor cudf.Series.init (#14450) @mroeschke
  • Remove the use of volatile in Parquet (#14448) @vuule
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Testing stream pool implementation (#14437) @shrshi
  • Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
  • Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
  • Use isinstance(..., cudf.IntervalDtype) instead of isintervaldtype (#14424) @mroeschke
  • Use isinstance(..., cudf.CategoricalDtype) instead of iscategoricaldtype (#14423) @mroeschke
  • Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
  • REF: Remove instances of pd.core (#14421) @mroeschke
  • Expose streams in public filling APIs for label_bins (#14401) @ZelboK
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Limit DELTABINARYPACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
  • Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
  • Expose streams in Parquet reader and writer APIs (#14359) @shrshi
  • Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
  • Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
  • Expose streams in ORC reader and writer APIs (#14350) @shrshi
  • Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
  • Add cuDF devcontainers (#14015) @trxcllnt
  • Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
  • Switch to scikit-build-core (#13531) @vyasr
  • Simplify null count checking in column equality comparator (#13312) @vyasr

- C++
Published by raydouglass about 2 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v24.04.00

πŸ”— Links

🚨 Breaking Changes

  • Add future_stack to DataFrame.stack (#15015) @galipremsagar
  • Deprecate groupby fillna (#15000) @mroeschke
  • Deprecate replace with categorical columns (#14988) @mroeschke
  • Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
  • Add pandas-2.x support in cudf (#14916) @galipremsagar

πŸ› Bug Fixes

  • Fix Index.difference to handle duplicate values when one of the inputs is empty (#15016) @galipremsagar
  • Add future_stack to DataFrame.stack (#15015) @galipremsagar
  • Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
  • Fix DataFrame.sort_index to respect ignore_index on all axis (#14995) @galipremsagar
  • Raise for pyarrow array that is tz-aware (#14980) @mroeschke
  • Direct SeriesGroupBy.aggregate to SeriesGroupBy.agg (#14971) @rjzamora
  • unset CUDF_SPILL after a pytest (#14958) @galipremsagar
  • Fix dask token normalization (#14829) @rjzamora
  • Fix 24.04 versions (#14825) @raydouglass

πŸ“– Documentation

  • [DOC] Update typo in docs example of structscolumnwrapper (#14949) @karthikeyann
  • Update cudf.pandas FAQ. (#14940) @bdice
  • Optimize doc builds (#14856) @vyasr
  • Add developer guideline to use east const. (#14836) @bdice
  • Notes convert to Pandas-compat (#12641) @Touutae-lab

πŸš€ New Features

  • Implement replace in pylibcudf (#15005) @vyasr
  • Implement rolling in pylibcudf (#14982) @vyasr
  • Implement joins in pylibcudf (#14972) @vyasr
  • Implement scans and reductions in pylibcudf (#14970) @vyasr
  • Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
  • Implement groupby in pylibcudf (#14945) @vyasr
  • POC for whitespace removal in input JSON data using FST (#14931) @shrshi
  • Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
  • Migrate unary operations to pylibcudf (#14850) @vyasr
  • Migrate binary operations to pylibcudf (#14821) @vyasr
  • Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
  • Support CUDA 12.2 (#14712) @jameslamb

πŸ› οΈ Improvements

  • Change copyifsafe to call thrust instead of the overload function (#15018) @davidwendt
  • Fix isstringdtype test for pandas 2.2 (#15012) @mroeschke
  • Clean up detail sequence header inclusion (#15007) @PointKernel
  • Use offsetalator in cudf::strings::wrap() (#15002) @davidwendt
  • Use offsetalator in cudf::strings::reverse (#15001) @davidwendt
  • Deprecate groupby fillna (#15000) @mroeschke
  • Ensure to_* IO methods respect pandas 2.2 keyword only deprecation (#14999) @mroeschke
  • Filter all DeprecationWarning's by ArrowTable.to_pandas() (#14989) @galipremsagar
  • Deprecate replace with categorical columns (#14988) @mroeschke
  • Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
  • Ensure that ctest is called with --no-tests=error. (#14983) @bdice
  • Deprecate non-integer periods in date_range and interval_range (#14976) @galipremsagar
  • Use fused types for overloaded function signatures (#14969) @vyasr
  • Deprecate certain frequency strings (#14967) @galipremsagar
  • Update copyrights for 24.04. (#14964) @bdice
  • Introduce GetJsonObjectOptions in getJSONObject Java API (#14956) @SurajAralihalli
  • JNI JSON read with DataSource and infered schema, along with basic java nested Schema JSON reads (#14954) @revans2
  • Make codecov only informational (always pass). (#14952) @bdice
  • Replace legacy cudf and dask_cudf imports as (d)gd (#14944) @mroeschke
  • Replace isdatetime64tz/interval_dtype with isinstance (#14943) @mroeschke
  • Update tests for pandas 2. (#14941) @bdice
  • Use more public pandas APIs (#14929) @mroeschke
  • Add pandas-2.x support in cudf (#14916) @galipremsagar
  • Use offsetalator in nvtext::bytepairencoding (#14888) @davidwendt
  • De-DOS line-endings (#14880) @wence-
  • Add detail cuco_allocator (#14877) @PointKernel
  • Move all core types to using enum class in Cython (#14876) @vyasr
  • Use int64 offset types for accessing code-points in nvtext::normalize (#14868) @davidwendt
  • Update conda-cpp-post-build-checks to branch-24.04. (#14854) @bdice
  • Remove deprecated strings functions (#14848) @davidwendt
  • Fix CI workflows for pandas-tests and add test summary. (#14847) @bdice
  • Use offsetalator in cudf::strings::copy_slice (#14844) @davidwendt
  • Fix V2 Parquet page alignment for use with zStandard compression (#14841) @etseidl
  • Fix calls to deprecated strings factory API in examples. (#14838) @bdice
  • Update pre-commit hooks (#14837) @bdice
  • Use rapids_cuda_set_runtime to determine cuda runtime usage by target (#14833) @vyasr
  • Remove getmeminfo functions from custom memory resources (#14832) @harrism
  • Fix debug build by splitting rowoperatortests_utilities.cu (#14826) @davidwendt
  • Remove -DNVBenchENABLECUPTI=OFF. (#14820) @bdice
  • Branch 24.04 merge branch 24.02 (#14809) @vyasr
  • Branch 24.04 merge branch 24.02 (#14806) @vyasr
  • Use offsetalator in nvtext tokenize functions (#14783) @davidwendt
  • Reduce execution time of Python ORC tests (#14776) @vuule
  • Use offsetalator in cudf::strings::split functions (#14757) @davidwendt
  • Use offsetalator in cudf::strings::findall (#14745) @davidwendt
  • Use offsetalator in cudf::strings::url_decode (#14744) @davidwendt
  • Use getoffsetvalue utility in strings shift function (#14743) @davidwendt

- C++
Published by rapids-bot[bot] about 2 years ago

https://github.com/rapidsai/cudf - v23.12.01

🚨 Breaking Changes

  • Raise error in reindex when index is not unique (#14400) @galipremsagar
  • Expose stream parameter to getjsonobject API (#14297) @davidwendt
  • Refactor cudf_kafka to use skbuild (#14292) @jdye64
  • Expose stream parameter in public strings convert APIs (#14255) @davidwendt
  • Upgrade to nvCOMP 3.0.4 (#13815) @vuule

πŸ› Bug Fixes

  • Fix synchronization issue when writing string columns with dictionary to ORC (#14595) @vuule
  • Update actions/labeler to v4 (#14562) @raydouglass
  • Fix data corruption when skipping rows (#14557) @etseidl
  • Fix function name typo in cudf.pandas profiler (#14514) @galipremsagar
  • Fix intermediate type checking in expression parsing (#14445) @vyasr
  • Forward merge branch-23.10 into branch-23.12 (#14435) @raydouglass
  • Remove needs: wheel-build-cudf. (#14427) @bdice
  • Fix dask dependency in custreamz (#14420) @vyasr
  • Ensure nvbench initializes nvml context when built statically (#14411) @robertmaynard
  • Support java AST String literal with desired encoding (#14402) @winningsix
  • Raise error in reindex when index is not unique (#14400) @galipremsagar
  • Always build nvbench statically so we don't need to package it (#14399) @robertmaynard
  • Fix token-count logic in nvtext::tokenizewithvocabulary (#14393) @davidwendt
  • Fix as_column(pd.Timestamp/Timedelta, length=) not respecting length (#14390) @mroeschke
  • cudf.pandas: cuDF subpath checking in module __getattr__ (#14388) @shwina
  • Fix and disable encoding for nanosecond statistics in ORC writer (#14367) @vuule
  • Add the new manylinux builds to the build job (#14351) @vyasr
  • cudf jit parser now supports .pragma instructions with quotes (#14348) @robertmaynard
  • Fix overflow check in cudf::merge (#14345) @divyegala
  • Add cramjam (#14344) @vyasr
  • Enable dask_cudf/io pytests in CI (#14338) @galipremsagar
  • Temporarily avoid the current build of pydata-sphinx-theme (#14332) @vyasr
  • Fix host buffer access from device function in the Parquet reader (#14328) @vuule
  • Run IO tests for Dask-cuDF (#14327) @rjzamora
  • Fix logical type issues in the Parquet writer (#14322) @vuule
  • Remove aws-sdk-pinning and revert to arrow 12.0.1 (#14319) @vyasr
  • test is_valid before reading column data (#14318) @etseidl
  • Fix gtest validity setting for TextTokenizeTest.Vocabulary (#14312) @davidwendt
  • Fixes stack context for json lines format that recovers from invalid JSON lines (#14309) @elstehle
  • Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
  • fixing thread index overflow issue (#14290) @hyperbolic2346
  • Fix memset error in nvtext::editdistancematrix (#14283) @davidwendt
  • Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
  • Handle empty string correctly in Parquet statistics (#14257) @etseidl
  • Fixes behaviour for incomplete lines when recover_with_nulls is enabled (#14252) @elstehle
  • cudf::detail::pinned_allocator doesn't throw from deallocate (#14251) @robertmaynard
  • Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
  • Fix the precision when converting a decimal128 column to an arrow array (#14230) @jihoonson
  • Fixing parquet list of struct interpretation (#13715) @hyperbolic2346

πŸ“– Documentation

  • Fix io reference in docs. (#14452) @bdice
  • Update README (#14374) @shwina
  • Example code for blog on new row comparators (#13795) @divyegala

πŸš€ New Features

  • Expose streams in public unary APIs (#14342) @vyasr
  • Add python tests for Parquet DELTABINARYPACKED encoder (#14316) @etseidl
  • Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
  • Expose streams in public null mask APIs (#14263) @vyasr
  • Expose streams in binaryop APIs (#14187) @vyasr
  • Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
  • Add decoder for DELTABYTEARRAY to Parquet reader (#14101) @etseidl
  • Add DELTABINARYPACKED encoder for Parquet writer (#14100) @etseidl
  • Add BytePairEncoder class to cuDF (#13891) @davidwendt
  • Upgrade to nvCOMP 3.0.4 (#13815) @vuule
  • Use pynvjitlink for CUDA 12+ MVC (#13650) @brandon-b-miller

πŸ› οΈ Improvements

  • Build concurrency for nightly and merge triggers (#14441) @bdice
  • Cleanup remaining usages of dask dependencies (#14407) @galipremsagar
  • Update to Arrow 14.0.1. (#14387) @bdice
  • Remove Cython libcpp wrappers (#14382) @vyasr
  • Forward-merge branch-23.10 to branch-23.12 (#14372) @bdice
  • Upgrade to arrow 14 (#14371) @galipremsagar
  • Fix a pytest typo in test_kurt_skew_error (#14368) @galipremsagar
  • Use new rapids-dask-dependency metapackage for managing dask versions (#14364) @vyasr
  • Change nullable() to has_nulls() in cudf::detail::gather (#14363) @divyegala
  • Split up scan_inclusive.cu to improve its compile time (#14358) @davidwendt
  • Implement userdatasourcewrapper isempty() and isdevicereadpreferred(). (#14357) @tpn
  • Added streams to CSV reader and writer api (#14340) @shrshi
  • Upgrade wheels to use arrow 13 (#14339) @vyasr
  • Rework nvtext::bytepairencoding API (#14337) @davidwendt
  • Improve performance of nvtext::tokenizewithvocabulary for long strings (#14336) @davidwendt
  • Upgrade arrow to 13 (#14330) @galipremsagar
  • Expose stream parameter in public nvtext replace APIs (#14329) @davidwendt
  • Drop pyorc dependency and use pandas/pyarrow instead (#14323) @galipremsagar
  • Avoid pyarrow.fs import for local storage (#14321) @rjzamora
  • Unpin dask and distributed for 23.12 development (#14320) @galipremsagar
  • Expose stream parameter in public nvtext tokenize APIs (#14317) @davidwendt
  • Added streams to JSON reader and writer api (#14313) @shrshi
  • Minor improvements in source_info (#14308) @vuule
  • Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
  • Add stream parameter to Set Operations (Public List APIs) (#14305) @SurajAralihalli
  • Expose stream parameter to getjsonobject API (#14297) @davidwendt
  • Sort dictionary data alphabetically in the ORC writer (#14295) @vuule
  • Expose stream parameter in public strings filter APIs (#14293) @davidwendt
  • Refactor cudf_kafka to use skbuild (#14292) @jdye64
  • Update shared-action-workflows references (#14289) @AyodeAwe
  • Register partd encode dispatch in dask_cudf (#14287) @rjzamora
  • Update versioning strategy (#14285) @vyasr
  • Move and rename byte-pair-encoding source files (#14284) @davidwendt
  • Expose stream parameter in public strings combine APIs (#14281) @davidwendt
  • Expose stream parameter in public strings contains APIs (#14280) @davidwendt
  • Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
  • Use branch-23.12 workflows. (#14271) @bdice
  • Refactor LogicalType for Parquet (#14264) @etseidl
  • Centralize chunked reading code in the parquet reader to readerimplchunking.cu (#14262) @nvdbaranec
  • Expose stream parameter in public strings replace APIs (#14261) @davidwendt
  • Expose stream parameter in public strings APIs (#14260) @davidwendt
  • Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
  • Make parquet schema index type consistent (#14256) @hyperbolic2346
  • Expose stream parameter in public strings convert APIs (#14255) @davidwendt
  • Add in java bindings for DataSource (#14254) @revans2
  • Reimplement cudf::merge for nested types without using comparators (#14250) @divyegala
  • Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
  • Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
  • Improve contains_column by invoking contains_table (#14238) @PointKernel
  • Detect and report errors in Parquet header parsing (#14237) @etseidl
  • Normalizing offsets iterator (#14234) @davidwendt
  • Forward merge 23.10 into 23.12 (#14231) @galipremsagar
  • Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
  • Enable indexalator for device code (#14206) @davidwendt
  • Marginally reduce memory footprint of joins (#14197) @wence-
  • Add nvtx annotations to spilling-based data movement (#14196) @wence-
  • Optimize ORC writer for decimal columns (#14190) @vuule
  • Remove the use of volatile in ORC (#14175) @vuule
  • Add bytes_per_second to distinctcount of streamcompaction nvbench. (#14172) @Blonck
  • Add bytes_per_second to transpose benchmark (#14170) @Blonck
  • cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
  • Add bytes_per_second to shift benchmark (#13950) @Blonck
  • Extract debug_utilities.hpp/cu from column_utilities.hpp/cu (#13720) @ttnghia

- C++
Published by raydouglass about 2 years ago

https://github.com/rapidsai/cudf - v23.12.00

🚨 Breaking Changes

  • Raise error in reindex when index is not unique (#14400) @galipremsagar
  • Expose stream parameter to getjsonobject API (#14297) @davidwendt
  • Refactor cudf_kafka to use skbuild (#14292) @jdye64
  • Expose stream parameter in public strings convert APIs (#14255) @davidwendt
  • Upgrade to nvCOMP 3.0.4 (#13815) @vuule

πŸ› Bug Fixes

  • Update actions/labeler to v4 (#14562) @raydouglass
  • Fix data corruption when skipping rows (#14557) @etseidl
  • Fix function name typo in cudf.pandas profiler (#14514) @galipremsagar
  • Fix intermediate type checking in expression parsing (#14445) @vyasr
  • Forward merge branch-23.10 into branch-23.12 (#14435) @raydouglass
  • Remove needs: wheel-build-cudf. (#14427) @bdice
  • Fix dask dependency in custreamz (#14420) @vyasr
  • Ensure nvbench initializes nvml context when built statically (#14411) @robertmaynard
  • Support java AST String literal with desired encoding (#14402) @winningsix
  • Raise error in reindex when index is not unique (#14400) @galipremsagar
  • Always build nvbench statically so we don't need to package it (#14399) @robertmaynard
  • Fix token-count logic in nvtext::tokenizewithvocabulary (#14393) @davidwendt
  • Fix as_column(pd.Timestamp/Timedelta, length=) not respecting length (#14390) @mroeschke
  • cudf.pandas: cuDF subpath checking in module __getattr__ (#14388) @shwina
  • Fix and disable encoding for nanosecond statistics in ORC writer (#14367) @vuule
  • Add the new manylinux builds to the build job (#14351) @vyasr
  • cudf jit parser now supports .pragma instructions with quotes (#14348) @robertmaynard
  • Fix overflow check in cudf::merge (#14345) @divyegala
  • Add cramjam (#14344) @vyasr
  • Enable dask_cudf/io pytests in CI (#14338) @galipremsagar
  • Temporarily avoid the current build of pydata-sphinx-theme (#14332) @vyasr
  • Fix host buffer access from device function in the Parquet reader (#14328) @vuule
  • Run IO tests for Dask-cuDF (#14327) @rjzamora
  • Fix logical type issues in the Parquet writer (#14322) @vuule
  • Remove aws-sdk-pinning and revert to arrow 12.0.1 (#14319) @vyasr
  • test is_valid before reading column data (#14318) @etseidl
  • Fix gtest validity setting for TextTokenizeTest.Vocabulary (#14312) @davidwendt
  • Fixes stack context for json lines format that recovers from invalid JSON lines (#14309) @elstehle
  • Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
  • fixing thread index overflow issue (#14290) @hyperbolic2346
  • Fix memset error in nvtext::editdistancematrix (#14283) @davidwendt
  • Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
  • Handle empty string correctly in Parquet statistics (#14257) @etseidl
  • Fixes behaviour for incomplete lines when recover_with_nulls is enabled (#14252) @elstehle
  • cudf::detail::pinned_allocator doesn't throw from deallocate (#14251) @robertmaynard
  • Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
  • Fix the precision when converting a decimal128 column to an arrow array (#14230) @jihoonson
  • Fixing parquet list of struct interpretation (#13715) @hyperbolic2346

πŸ“– Documentation

  • Fix io reference in docs. (#14452) @bdice
  • Update README (#14374) @shwina
  • Example code for blog on new row comparators (#13795) @divyegala

πŸš€ New Features

  • Expose streams in public unary APIs (#14342) @vyasr
  • Add python tests for Parquet DELTABINARYPACKED encoder (#14316) @etseidl
  • Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
  • Expose streams in public null mask APIs (#14263) @vyasr
  • Expose streams in binaryop APIs (#14187) @vyasr
  • Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
  • Add decoder for DELTABYTEARRAY to Parquet reader (#14101) @etseidl
  • Add DELTABINARYPACKED encoder for Parquet writer (#14100) @etseidl
  • Add BytePairEncoder class to cuDF (#13891) @davidwendt
  • Upgrade to nvCOMP 3.0.4 (#13815) @vuule
  • Use pynvjitlink for CUDA 12+ MVC (#13650) @brandon-b-miller

πŸ› οΈ Improvements

  • Build concurrency for nightly and merge triggers (#14441) @bdice
  • Cleanup remaining usages of dask dependencies (#14407) @galipremsagar
  • Update to Arrow 14.0.1. (#14387) @bdice
  • Remove Cython libcpp wrappers (#14382) @vyasr
  • Forward-merge branch-23.10 to branch-23.12 (#14372) @bdice
  • Upgrade to arrow 14 (#14371) @galipremsagar
  • Fix a pytest typo in test_kurt_skew_error (#14368) @galipremsagar
  • Use new rapids-dask-dependency metapackage for managing dask versions (#14364) @vyasr
  • Change nullable() to has_nulls() in cudf::detail::gather (#14363) @divyegala
  • Split up scan_inclusive.cu to improve its compile time (#14358) @davidwendt
  • Implement userdatasourcewrapper isempty() and isdevicereadpreferred(). (#14357) @tpn
  • Added streams to CSV reader and writer api (#14340) @shrshi
  • Upgrade wheels to use arrow 13 (#14339) @vyasr
  • Rework nvtext::bytepairencoding API (#14337) @davidwendt
  • Improve performance of nvtext::tokenizewithvocabulary for long strings (#14336) @davidwendt
  • Upgrade arrow to 13 (#14330) @galipremsagar
  • Expose stream parameter in public nvtext replace APIs (#14329) @davidwendt
  • Drop pyorc dependency and use pandas/pyarrow instead (#14323) @galipremsagar
  • Avoid pyarrow.fs import for local storage (#14321) @rjzamora
  • Unpin dask and distributed for 23.12 development (#14320) @galipremsagar
  • Expose stream parameter in public nvtext tokenize APIs (#14317) @davidwendt
  • Added streams to JSON reader and writer api (#14313) @shrshi
  • Minor improvements in source_info (#14308) @vuule
  • Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
  • Add stream parameter to Set Operations (Public List APIs) (#14305) @SurajAralihalli
  • Expose stream parameter to getjsonobject API (#14297) @davidwendt
  • Sort dictionary data alphabetically in the ORC writer (#14295) @vuule
  • Expose stream parameter in public strings filter APIs (#14293) @davidwendt
  • Refactor cudf_kafka to use skbuild (#14292) @jdye64
  • Update shared-action-workflows references (#14289) @AyodeAwe
  • Register partd encode dispatch in dask_cudf (#14287) @rjzamora
  • Update versioning strategy (#14285) @vyasr
  • Move and rename byte-pair-encoding source files (#14284) @davidwendt
  • Expose stream parameter in public strings combine APIs (#14281) @davidwendt
  • Expose stream parameter in public strings contains APIs (#14280) @davidwendt
  • Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
  • Use branch-23.12 workflows. (#14271) @bdice
  • Refactor LogicalType for Parquet (#14264) @etseidl
  • Centralize chunked reading code in the parquet reader to readerimplchunking.cu (#14262) @nvdbaranec
  • Expose stream parameter in public strings replace APIs (#14261) @davidwendt
  • Expose stream parameter in public strings APIs (#14260) @davidwendt
  • Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
  • Make parquet schema index type consistent (#14256) @hyperbolic2346
  • Expose stream parameter in public strings convert APIs (#14255) @davidwendt
  • Add in java bindings for DataSource (#14254) @revans2
  • Reimplement cudf::merge for nested types without using comparators (#14250) @divyegala
  • Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
  • Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
  • Improve contains_column by invoking contains_table (#14238) @PointKernel
  • Detect and report errors in Parquet header parsing (#14237) @etseidl
  • Normalizing offsets iterator (#14234) @davidwendt
  • Forward merge 23.10 into 23.12 (#14231) @galipremsagar
  • Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
  • Enable indexalator for device code (#14206) @davidwendt
  • Marginally reduce memory footprint of joins (#14197) @wence-
  • Add nvtx annotations to spilling-based data movement (#14196) @wence-
  • Optimize ORC writer for decimal columns (#14190) @vuule
  • Remove the use of volatile in ORC (#14175) @vuule
  • Add bytes_per_second to distinctcount of streamcompaction nvbench. (#14172) @Blonck
  • Add bytes_per_second to transpose benchmark (#14170) @Blonck
  • cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
  • Add bytes_per_second to shift benchmark (#13950) @Blonck
  • Extract debug_utilities.hpp/cu from column_utilities.hpp/cu (#13720) @ttnghia

- C++
Published by raydouglass about 2 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v24.02.00

πŸ”— Links

🚨 Breaking Changes

  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke

πŸ› Bug Fixes

  • Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
  • REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
  • Improve memory footprint of isin by using contains (#14478) @wence-
  • Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
  • Correct dtype of count aggregations on empty dataframes (#14473) @wence-
  • Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
  • JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
  • Fix default stream use in the CSV reader (#14443) @vuule
  • Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke

πŸ“– Documentation

  • Some doxygen improvements (#14469) @vyasr
  • Remove warning in dask-cudf docs (#14454) @wence-
  • Update README links with redirects. (#14378) @bdice

πŸš€ New Features

  • Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov

πŸ› οΈ Improvements

  • Split libarrow build dependencies. (#14506) @bdice
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
  • Refactor Parquet kernel_error (#14464) @etseidl
  • Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
  • Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
  • Expose stream parameter in public nvtext APIs (#14456) @davidwendt
  • Remove the use of volatile in Parquet (#14448) @vuule
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
  • Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
  • Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
  • REF: Remove instances of pd.core (#14421) @mroeschke
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
  • Add cuDF devcontainers (#14015) @trxcllnt

- C++
Published by rapids-bot[bot] over 2 years ago

https://github.com/rapidsai/cudf - v23.10.02

🚨 Breaking Changes

  • Raise error in reindex when index is not unique (#14429) @galipremsagar
  • Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
  • Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
  • Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
  • Create tableinputmetadata from a table_metadata (#13920) @etseidl
  • Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
  • Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
  • Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
  • Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
  • Remove the libcudf cudf::offset_type type (#13788) @davidwendt
  • Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
  • Update to Cython 3.0.0 (#13777) @vyasr
  • Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
  • Enforce deprecations in 23.10 (#13732) @galipremsagar
  • Upgrade to arrow 12 (#13728) @galipremsagar
  • Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

πŸ› Bug Fixes

  • Raise error in reindex when index is not unique (#14429) @galipremsagar
  • Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
  • Fix inaccuracy in decimal128 rounding. (#14233) @bdice
  • Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
  • Fix pytorch related pytest (#14198) @galipremsagar
  • Pin to aws-sdk-cpp&lt;1.11 (#14173) @pentschev
  • Fix assert failure for range window functions (#14168) @mythrocks
  • Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
  • Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
  • Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
  • Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
  • Fix kernel launch error for cudf::io::orc::gpu::rowgroupcharcounts_kernel (#14139) @davidwendt
  • Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
  • Fix DataFrame.values with no columns but index (#14134) @mroeschke
  • Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
  • Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
  • Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
  • Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
  • Preserve name of the column while initializing a DataFrame (#14110) @galipremsagar
  • Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
  • Drop kwargs from Series.count (#14106) @galipremsagar
  • Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
  • Only use memory resources that haven't been freed (#14103) @robertmaynard
  • Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
  • Validate ignoreindex type in dropduplicates (#14098) @mroeschke
  • Fix renaming Series and Index (#14080) @galipremsagar
  • Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
  • Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
  • Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
  • Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
  • Raise NotImplementedError in todatetime with dayfirst without inferformat (#14058) @mroeschke
  • Fix various issues in Index.intersection (#14054) @galipremsagar
  • Fix Index.difference to match with pandas (#14053) @galipremsagar
  • Fix empty string column construction (#14052) @galipremsagar
  • Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
  • Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
  • Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
  • Ignore compile_commands.json (#14048) @harrism
  • Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
  • Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
  • Implement sort_remaining for sort_index (#14033) @wence-
  • Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
  • Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
  • Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
  • Fix return type of MultiIndex.difference (#14009) @galipremsagar
  • Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
  • Fix map column can not be non-nullable for java (#14003) @res-life
  • Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
  • Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
  • Use threadindextype to avoid out of bounds accesses in conditional joins (#13971) @vyasr
  • Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
  • Use cudf::threadindextype in getjsonobject and tdigest kernels (#13962) @nvdbaranec
  • Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
  • Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
  • Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
  • Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
  • Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
  • Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
  • Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
  • Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
  • Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
  • Fixed processed bytes calculation in APPLYBOOLEANMASK benchmark. (#13937) @Blonck
  • Fix construction of Grouping objects (#13932) @galipremsagar
  • Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
  • Fix handling of typecasting in searchsorted (#13925) @galipremsagar
  • Preserve index name in reindex (#13917) @galipremsagar
  • Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
  • Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
  • Use cudf::threadindextype in concatenate.cu. (#13906) @bdice
  • Use cudf::threadindextype in replace.cu. (#13905) @bdice
  • Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
  • Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
  • Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
  • Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
  • Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
  • Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
  • Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
  • Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
  • Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
  • Fix nvtext::generatecharacterngrams performance regression for longer strings (#13874) @davidwendt
  • Fix return type of MultiIndex.levels (#13870) @galipremsagar
  • Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
  • Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
  • Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
  • Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
  • Fix binary operations between Series and Index (#13842) @galipremsagar
  • Update makelistscolumnfromscalar to use makeoffsetschild_column utility (#13841) @davidwendt
  • Fix read out of bounds in string concatenate (#13838) @pentschev
  • Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
  • Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
  • Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
  • Fix cuFile I/O factories (#13829) @vuule
  • DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
  • Branch 23.10 merge 23.08 (#13822) @vyasr
  • Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
  • No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
  • Raise error when mixed types are being constructed (#13816) @galipremsagar
  • Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
  • Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
  • Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
  • Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
  • Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
  • Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
  • Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
  • Fix negative unary operation for boolean type (#13780) @galipremsagar
  • Fix contains(in) method for Series (#13779) @galipremsagar
  • Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
  • Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
  • Preserve names of column object in various APIs (#13772) @galipremsagar
  • Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
  • Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
  • Provide our own Cython declaration for make_unique (#13746) @wence-

πŸ“– Documentation

  • Fix benchmark image. (#14376) @bdice
  • Fix typo in docstring: metadata. (#14025) @bdice
  • Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
  • Simplify Python doc configuration (#13826) @vyasr
  • Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
  • Fix all warnings in Python docs (#13789) @vyasr

πŸš€ New Features

  • [Java] Add JNI bindings for integers_to_hex (#14205) @razajafri
  • Propagate errors from Parquet reader kernels back to host (#14167) @vuule
  • JNI for HISTOGRAM and MERGE_HISTOGRAM aggregations (#14154) @ttnghia
  • Expose streams in all public sorting APIs (#14146) @vyasr
  • Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
  • Implement GroupBy.value_counts to match pandas API (#14114) @stmio
  • Refactor parquet thrift reader (#14097) @etseidl
  • Refactor hash_reduce_by_row (#14095) @ttnghia
  • Support negative preceding/following for ROW window functions (#14093) @mythrocks
  • Support for progressive parquet chunked reading. (#14079) @nvdbaranec
  • Implement HISTOGRAM and MERGE_HISTOGRAM aggregations (#14045) @ttnghia
  • Expose streams in public search APIs (#14034) @vyasr
  • Expose streams in public replace APIs (#14010) @vyasr
  • Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
  • Expose streams in public filling APIs (#13990) @vyasr
  • Expose streams in public concatenate APIs (#13987) @vyasr
  • Use HostMemoryAllocator in jni::allocatehostbuffer (#13975) @gerashegalov
  • Enable fractional null probability for hashing benchmark (#13967) @Blonck
  • Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
  • Add nvtext::tokenizewithvocabulary API (#13930) @davidwendt
  • Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
  • Add HostMemoryAllocator interface (#13924) @gerashegalov
  • Global stream pool (#13922) @etseidl
  • Create tableinputmetadata from a table_metadata (#13920) @etseidl
  • Translate column size overflow exception to JNI (#13911) @mythrocks
  • Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
  • Exclude some tests from running with the compute sanitizer (#13872) @firestarman
  • Expand statistics support in ORC writer (#13848) @vuule
  • Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
  • Add cudf::strings::find function with target per row (#13808) @davidwendt
  • Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
  • Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
  • Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
  • Support corr in GroupBy.apply through the jit engine (#13767) @shwina
  • Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
  • Support more numeric types in Groupby.apply with engine=&#39;jit&#39; (#13729) @brandon-b-miller
  • [FEA] Add DELTABINARYPACKED decoding support to Parquet reader (#13637) @etseidl
  • Read FIXEDLENBYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

πŸ› οΈ Improvements

  • Update shared-action-workflows references (backport from 23.12 to 23.10) (#14300) @AyodeAwe
  • Pin dask and distributed for 23.10 release (#14225) @galipremsagar
  • update rmm tag path (#14195) @AyodeAwe
  • Disable Recently Updated Check (#14193) @ajschmidt8
  • Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
  • Add Parquet reader benchmarks for row selection (#14147) @vuule
  • Update image names (#14145) @AyodeAwe
  • Support callables in DataFrame.assign (#14142) @wence-
  • Reduce memory usage of ascategoricalcolumn (#14138) @wence-
  • Replace Python scalar conversions with libcudf (#14124) @vyasr
  • Update to clang 16.0.6. (#14120) @bdice
  • Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
  • Add stream parameter to external dict APIs (#14115) @SurajAralihalli
  • Add fallback matrix for nvcomp. (#14082) @bdice
  • [Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
  • Remove header tests (#14072) @ajschmidt8
  • Refactor contains_table with cuco::static_set (#14064) @PointKernel
  • Remove debug print in a Parquet test (#14063) @vuule
  • Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
  • Expose stream parameter in public strings find APIs (#14060) @davidwendt
  • Update doxygen to 1.9.1 (#14059) @vyasr
  • Remove the mr from the base fixture (#14057) @vyasr
  • Expose streams in public strings case APIs (#14056) @davidwendt
  • Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
  • Use cudf::makeemptycolumn instead of column_view constructor (#14030) @davidwendt
  • Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
  • Explicitly depend on zlib in conda recipes (#14018) @wence-
  • Use grid_stride for stride computations. (#13996) @bdice
  • Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
  • Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
  • Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
  • Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
  • Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
  • Use thread_index_type in partitioning.cu (#13973) @divyegala
  • Use cudf::thread_index_type in merge.cu (#13972) @divyegala
  • Use copy-pr-bot (#13970) @ajschmidt8
  • Use cudf::threadindextype in strings custom kernels (#13968) @davidwendt
  • Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
  • Added pinned pool reservation API for java (#13964) @revans2
  • Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
  • Add bytes_per_second to copyifelse benchmark (#13960) @Blonck
  • Add pandas compatible output to Series.unique (#13959) @galipremsagar
  • Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
  • Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
  • Make HostColumnVector.getRefCount public (#13934) @abellina
  • Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
  • Add java API to get size of host memory needed to copy column view (#13919) @revans2
  • Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
  • Enable hugepage for arrow host allocations (#13914) @madsbk
  • Improve performance of nvtext::edit_distance (#13912) @davidwendt
  • Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
  • Use empty() instead of size() where possible (#13908) @vuule
  • [JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
  • Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
  • Allow explicit shuffle=&quot;p2p&quot; within dask-cudf API (#13893) @rjzamora
  • Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
  • Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
  • Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
  • Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
  • Fixes a performance regression in FST (#13850) @elstehle
  • Set native handles to null on close in Java wrapper classes (#13818) @jlowe
  • Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
  • Update lists::contains to experimental row comparator (#13810) @divyegala
  • Reduce lists::contains dispatches for scalars (#13805) @divyegala
  • Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
  • Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
  • Remove the libcudf cudf::offset_type type (#13788) @davidwendt
  • Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
  • Update to Cython 3.0.0 (#13777) @vyasr
  • Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
  • Branch 23.10 merge 23.08 (#13773) @vyasr
  • Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
  • Branch 23.10 merge 23.08 (#13753) @vyasr
  • Enforce deprecations in 23.10 (#13732) @galipremsagar
  • Upgrade to arrow 12 (#13728) @galipremsagar
  • Refactors JSON reader's pushdown automaton (#13716) @elstehle
  • Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

- C++
Published by raydouglass over 2 years ago

https://github.com/rapidsai/cudf - v23.04.01

🚨 Breaking Changes

  • Pin dask and distributed for release (#13070) @galipremsagar
  • Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
  • Update minimum pandas and numpy pinnings (#12887) @galipremsagar
  • Deprecate names & dtype in Index.copy (#12825) @galipremsagar
  • Deprecate Index.is_* methods (#12820) @galipremsagar
  • Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
  • Deprecate na_sentinel in factorize (#12817) @galipremsagar
  • Make string methods return a Series with a useful Index (#12814) @shwina
  • Produce useful guidance on overflow error in to_csv (#12705) @wence-
  • Move strings_udf code into cuDF (#12669) @brandon-b-miller
  • Remove cudf::strings::repeatstringsoutputsizes and optional parameter from cudf::strings::repeatstrings (#12609) @davidwendt
  • Replace message parsing with throwing more specific exceptions (#12426) @vyasr

πŸ› Bug Fixes

  • Pin curand version (#13127) @vyasr
  • Fix memcheck script to execute only _TEST files found in bin/gtests/libcudf (#13006) @davidwendt
  • Fix DataFrame constructor to broadcast scalar inputs properly (#12997) @galipremsagar
  • Drop force_nullable_schema from chunked parquet writer (#12996) @galipremsagar
  • Fix gtest column utility comparator diff reporting (#12995) @davidwendt
  • Handle index names while performing groupby (#12992) @galipremsagar
  • Fix __setitem__ on string columns when the scalar value ends in a null byte (#12991) @wence-
  • Fix sort_values when column is all empty strings (#12988) @eriknw
  • Remove unused variable and fix memory issue in ORC writer (#12984) @ttnghia
  • Pre-emptive fix for upstream dask.dataframe.read_parquet changes (#12983) @rjzamora
  • Remove MANIFEST.in use auto-generated one for sdists and package_data for wheels (#12960) @vyasr
  • Update to use rapids-export(COMPONENTS) feature. (#12959) @robertmaynard
  • cudftestutil supports static gtest dependencies (#12957) @robertmaynard
  • Include gtest in build environment. (#12956) @vyasr
  • Correctly handle scalar indices in Index.__getitem__ (#12955) @wence-
  • Avoid building cython twice (#12945) @galipremsagar
  • Fix set index error for Series rolling window operations (#12942) @galipremsagar
  • Fix calculation of null counts for Parquet statistics (#12938) @etseidl
  • Preserve integer dtype of hive-partitioned column containing nulls (#12930) @rjzamora
  • Use getcurrentdeviceresource for intermediate allocations in COLLECTLIST window code (#12927) @karthikeyann
  • Mark dlpack tensor deleter as noexcept to match PyCapsule_Destructor signature. (#12921) @bdice
  • Fix conda recipe post-link.sh typo (#12916) @pentschev
  • minrows and numrows are swapped in ComputePageSizes declaration in Parquet reader (#12886) @etseidl
  • Expect cupy to now support bool arrays for dlpack. (#12883) @vyasr
  • Use python -m pytest for nightly wheel tests (#12871) @bdice
  • Parquet writer columnsize() should return a sizet (#12870) @etseidl
  • Fix cudf::hash_partition kernel launch error with decimal128 types (#12863) @davidwendt
  • Fix an issue with parquet chunked reader undercounting string lengths. (#12859) @nvdbaranec
  • Remove tokenizers pre-install pinning. (#12854) @vyasr
  • Fix parquet RangeIndex bug (#12838) @rjzamora
  • Remove KAFKAHOSTTEST from compute-sanitizer check (#12831) @davidwendt
  • Make string methods return a Series with a useful Index (#12814) @shwina
  • Tell cudf_kafka to use header-only fmt (#12796) @vyasr
  • Add GroupBy.dtypes (#12783) @galipremsagar
  • Fix a leak in a test and clarify some test names (#12781) @revans2
  • Fix bug in all-null list due to joinlistelements special handling (#12767) @karthikeyann
  • Add try/except for expected null-schema error in read_parquet (#12756) @rjzamora
  • Throw an exception if an unsupported page encoding is detected in Parquet reader (#12754) @etseidl
  • Fix a bug with num_keys in _scatter_by_slice (#12749) @thomcom
  • Bump pinned rapids wheel deps to 23.4 (#12735) @sevagh
  • Rework logic in cudf::strings::split_record to improve performance (#12729) @davidwendt
  • Add always_nullable flag to Dremel encoding (#12727) @divyegala
  • Fix memcheck read error in compound segmented reduce (#12722) @davidwendt
  • Fix faulty conditional logic in JIT GroupBy.apply (#12706) @brandon-b-miller
  • Produce useful guidance on overflow error in to_csv (#12705) @wence-
  • Handle parquet list data corner case (#12698) @nvdbaranec
  • Fix missing trailing comma in json writer (#12688) @karthikeyann
  • Remove child fom newCudaAsyncMemoryResource (#12681) @abellina
  • Handle bool types in round API (#12670) @galipremsagar
  • Ensure all of device bitmask is initialized in from_arrow (#12668) @wence-
  • Fix from_arrow to load a sliced arrow table (#12665) @galipremsagar
  • Fix dask-cudf read_parquet bug for multi-file aggregation (#12663) @rjzamora
  • Fix AllocateLikeTest gtests reading uninitialized null-mask (#12643) @davidwendt
  • Fix find_common_dtype and values to handle complex dtypes (#12537) @galipremsagar
  • Fix fetching of MultiIndex values when a label is passed (#12521) @galipremsagar
  • Fix Series comparison vs scalars (#12519) @brandon-b-miller
  • Allow casting from UDFString back to StringView to call methods in strings_udf (#12363) @brandon-b-miller

πŸ“– Documentation

  • Fix GroupBy.apply doc examples rendering (#12994) @brandon-b-miller
  • add sphinx building and s3 uploading for dask-cudf docs (#12982) @quasiben
  • Add developer documentation forbidding default parameters in detail APIs (#12978) @vyasr
  • Add README symlink for dask-cudf. (#12946) @bdice
  • Remove return type from @return doxygen tags (#12908) @davidwendt
  • Fix docs build to be pydata-sphinx-theme=0.13.0 compatible (#12874) @galipremsagar
  • Add skeleton API and prose documentation for dask-cudf (#12725) @wence-
  • Enable doctests for GroupBy methods (#12658) @brandon-b-miller
  • Add comment about CUB patch for SegmentedSortInt.Bool gtest (#12611) @davidwendt

πŸš€ New Features

  • Add JNI method for strings::replace multi variety (#12979) @NVnavkumar
  • Add nunique aggregation support for cudf::segmented_reduce (#12972) @davidwendt
  • Refactor orc chunked writer (#12949) @ttnghia
  • Make Parquet writer nullable option application to single table writes (#12933) @vuule
  • Refactor io::orc::ProtobufWriter (#12877) @ttnghia
  • Make timezone table independent from ORC (#12805) @vuule
  • Cache JIT GroupBy.apply functions (#12802) @brandon-b-miller
  • Implement initial support for avro logical types (#6482) (#12788) @tpn
  • Update tests/column_utilities to use experimental::equality row comparator (#12777) @divyegala
  • Update distinct/unique_count to experimental::row hasher/comparator (#12776) @divyegala
  • Update hash_partition to use experimental::row::row_hasher (#12761) @divyegala
  • Update is_sorted to use experimental::row::lexicographic (#12752) @divyegala
  • Update default data source in cuio reader benchmarks (#12740) @PointKernel
  • Reenable stream identification library in CI (#12714) @vyasr
  • Add regex_program strings splitting java APIs and tests (#12713) @cindyyuanjiang
  • Add regex_program strings replacing java APIs and tests (#12701) @cindyyuanjiang
  • Add regex_program strings extract java APIs and tests (#12699) @cindyyuanjiang
  • Variable fragment sizes for Parquet writer (#12685) @etseidl
  • Add segmented reduction support for fixed-point types (#12680) @davidwendt
  • Move strings_udf code into cuDF (#12669) @brandon-b-miller
  • Add regex_program searching APIs and related java classes (#12666) @cindyyuanjiang
  • Add logging to libcudf (#12637) @vuule
  • Add compound aggregations to cudf::segmented_reduce (#12573) @davidwendt
  • Convert rank to use to experimental row comparators (#12481) @divyegala
  • Use rapids-cmake parallel testing feature (#12451) @robertmaynard
  • Enable detection of undesired stream usage (#12089) @vyasr

πŸ› οΈ Improvements

  • Pin dask and distributed for release (#13070) @galipremsagar
  • Pin cupy in wheel tests to supported versions (#13041) @vyasr
  • Pin numba version (#13001) @vyasr
  • Rework gtests SequenceTest to remove using namepace cudf (#12985) @davidwendt
  • Stop setting package version attribute in wheels (#12977) @vyasr
  • Move detail reduction functions to cudf::reduction::detail namespace (#12971) @davidwendt
  • Remove default detail mrs: part7 (#12970) @vyasr
  • Remove default detail mrs: part6 (#12969) @vyasr
  • Remove default detail mrs: part5 (#12968) @vyasr
  • Remove default detail mrs: part4 (#12967) @vyasr
  • Remove default detail mrs: part3 (#12966) @vyasr
  • Remove default detail mrs: part2 (#12965) @vyasr
  • Remove default detail mrs: part1 (#12964) @vyasr
  • Add force_nullable_schema parameter to Parquet writer. (#12952) @galipremsagar
  • Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
  • Remove remaining default stream parameters (#12943) @vyasr
  • Fix cudf::segmented_reduce gtest for ANY aggregation (#12940) @davidwendt
  • Implement groupby.head and groupby.tail (#12939) @wence-
  • Fix libcudf gtests to pass null-count=0 for empty validity masks (#12923) @davidwendt
  • Migrate parquet encoding to use experimental row operators (#12918) @PointKernel
  • Fix benchmarks coded in namespace cudf and using namespace cudf (#12915) @karthikeyann
  • Fix io/text gtests coded in namespace cudf::test (#12914) @karthikeyann
  • Pass SCCACHE_S3_USE_SSL to conda builds (#12910) @ajschmidt8
  • Fix FST, JSON gtests & benchmarks coded in namespace cudf::test (#12907) @karthikeyann
  • Generate pyproject dependencies using dfg (#12906) @vyasr
  • Update libcudf counting functions to specify cudf::size_type (#12904) @davidwendt
  • Fix moto env vars & pass AWS_SESSION_TOKEN to conda builds (#12902) @ajschmidt8
  • Rewrite CSV writer benchmark with nvbench (#12901) @PointKernel
  • Rework some code logic to reduce iterator and comparator inlining to improve compile time (#12900) @davidwendt
  • Deprecate line_terminator in favor of lineterminator in to_csv (#12896) @wence-
  • Add stream and mr parameters for structs::detail::flatten_nested_columns (#12892) @ttnghia
  • Deprecate libcudf regex APIs accepting pattern strings directly (#12891) @davidwendt
  • Remove default parameters from detail headers in include (#12888) @vyasr
  • Update minimum pandas and numpy pinnings (#12887) @galipremsagar
  • Implement groupby.sample (#12882) @wence-
  • Update JNI build ENV default to gcc 11 (#12881) @pxLi
  • Change return type of cudf::structs::detail::flatten_nested_columns to smart pointer (#12878) @ttnghia
  • Fix passing seed parameter to MurmurHash3_32 in cudf::hash() function (#12875) @davidwendt
  • Remove manual artifact upload step in CI (#12869) @ajschmidt8
  • Update to GCC 11 (#12868) @bdice
  • Fix null hive-partition behavior in dask-cudf parquet (#12866) @rjzamora
  • Update to protobuf>=4.21.6,<4.22. (#12864) @bdice
  • Update RMM allocators (#12861) @pentschev
  • Improve performance for replace-multi for long strings (#12858) @davidwendt
  • Drop Python 3.7 handling for pickle protocol 4 (#12857) @jakirkham
  • Migrate as much as possible to pyproject.toml (#12850) @vyasr
  • Enable nbqa pre-commit hooks for isort and black. (#12848) @bdice
  • Setting a threshold for KvikIO IO (#12841) @madsbk
  • Update datasets download URL (#12840) @jjacobelli
  • Make docs builds less verbose (#12836) @AyodeAwe
  • Consolidate linter configs into pyproject.toml (#12834) @vyasr
  • Deprecate names & dtype in Index.copy (#12825) @galipremsagar
  • Deprecate inplace parameters in categorical methods (#12824) @galipremsagar
  • Add optional text file support to ninja-log utility (#12823) @davidwendt
  • Deprecate Index.is_* methods (#12820) @galipremsagar
  • Add dfg as a pre-commit hook (#12819) @vyasr
  • Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
  • Deprecate na_sentinel in factorize (#12817) @galipremsagar
  • Shuffling read into a sub function in parquet read (#12809) @hyperbolic2346
  • Fixing parquet coalescing of reads (#12808) @hyperbolic2346
  • CI: Remove specification of manual stage for check_style.sh script. (#12803) @csadorf
  • Add compute-sanitizer github workflow action to nightly tests (#12800) @davidwendt
  • Enable groupby std and variance aggregation types in libcudf Debug build (#12799) @davidwendt
  • Expose seed argument to hash_values (#12795) @ayushdg
  • Fix groupby gtests coded in namespace cudf::test (#12784) @davidwendt
  • Improve performance for cudf::strings::count_characters for long strings (#12779) @davidwendt
  • Deallocate encoded data in ORC writer immediately after compression (#12770) @vuule
  • Stop force pulling fmt in nvbench. (#12768) @vyasr
  • Remove now redundant cuda initialization (#12758) @vyasr
  • Adds JSON reader, writer io benchmark (#12753) @karthikeyann
  • Use test paths relative to package directory. (#12751) @bdice
  • Add build metrics report as artifact to cpp-build workflow (#12750) @davidwendt
  • Add JNI methods for detecting and purging non-empty nulls from LIST and STRUCT (#12742) @razajafri
  • Stop using versioneer to manage versions (#12741) @vyasr
  • Reduce error handling verbosity in CI tests scripts (#12738) @AjayThorve
  • Reduce the number of test cases in multibyte_split benchmark (#12737) @PointKernel
  • Update shared workflow branches (#12733) @ajschmidt8
  • JNI switches to nested JSON reader (#12732) @res-life
  • Changing cudf::io::source_info to use cudf::host_span&lt;std::byte&gt; in a non-breaking form (#12730) @hyperbolic2346
  • Add nvbench environment class for initializing RMM in benchmarks (#12728) @davidwendt
  • Split C++ and Python build dependencies into separate lists. (#12724) @bdice
  • Add build dependencies to Java tests. (#12723) @bdice
  • Allow setting the seed argument for hash partition (#12715) @firestarman
  • Remove gpuCI scripts. (#12712) @bdice
  • Unpin dask and distributed for development (#12710) @galipremsagar
  • partition_by_hash(): use _split() (#12704) @madsbk
  • Remove DataFrame.quantiles from docs. (#12684) @bdice
  • Fast path for experimental::row::equality (#12676) @divyegala
  • Move date to build string in conda recipe (#12661) @ajschmidt8
  • Refactor reduction logic for fixed-point types (#12652) @davidwendt
  • Pay off some JNI RMM API tech debt (#12632) @revans2
  • Merge copy-on-write feature branch into branch-23.04 (#12619) @galipremsagar
  • Remove cudf::strings::repeatstringsoutputsizes and optional parameter from cudf::strings::repeatstrings (#12609) @davidwendt
  • Pin cuda-nvrtc. (#12606) @bdice
  • Remove cudf::test::print calls from libcudf gtests (#12604) @davidwendt
  • Init JNI version 23.04.0-SNAPSHOT (#12599) @pxLi
  • Add performance benchmarks to user facing docs (#12595) @galipremsagar
  • Add docs build job (#12592) @AyodeAwe
  • Replace message parsing with throwing more specific exceptions (#12426) @vyasr
  • Support conversion to/from cudf in dask.dataframe.core.to_backend (#12380) @rjzamora

- C++
Published by raydouglass over 2 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v23.10.00

πŸ”— Links

🚨 Breaking Changes

  • Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
  • Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
  • Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
  • Create tableinputmetadata from a table_metadata (#13920) @etseidl
  • Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
  • Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
  • Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
  • Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
  • Remove the libcudf cudf::offset_type type (#13788) @davidwendt
  • Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
  • Update to Cython 3.0.0 (#13777) @vyasr
  • Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
  • Enforce deprecations in 23.10 (#13732) @galipremsagar
  • Upgrade to arrow 12 (#13728) @galipremsagar
  • Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

πŸ› Bug Fixes

  • Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
  • Fix inaccuracy in decimal128 rounding. (#14233) @bdice
  • Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
  • Fix pytorch related pytest (#14198) @galipremsagar
  • Pin to aws-sdk-cpp&lt;1.11 (#14173) @pentschev
  • Fix assert failure for range window functions (#14168) @mythrocks
  • Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
  • Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
  • Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
  • Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
  • Fix kernel launch error for cudf::io::orc::gpu::rowgroupcharcounts_kernel (#14139) @davidwendt
  • Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
  • Fix DataFrame.values with no columns but index (#14134) @mroeschke
  • Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
  • Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
  • Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
  • Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
  • Preserve name of the column while initializing a DataFrame (#14110) @galipremsagar
  • Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
  • Drop kwargs from Series.count (#14106) @galipremsagar
  • Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
  • Only use memory resources that haven't been freed (#14103) @robertmaynard
  • Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
  • Validate ignoreindex type in dropduplicates (#14098) @mroeschke
  • Fix renaming Series and Index (#14080) @galipremsagar
  • Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
  • Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
  • Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
  • Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
  • Raise NotImplementedError in todatetime with dayfirst without inferformat (#14058) @mroeschke
  • Fix various issues in Index.intersection (#14054) @galipremsagar
  • Fix Index.difference to match with pandas (#14053) @galipremsagar
  • Fix empty string column construction (#14052) @galipremsagar
  • Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
  • Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
  • Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
  • Ignore compile_commands.json (#14048) @harrism
  • Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
  • Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
  • Implement sort_remaining for sort_index (#14033) @wence-
  • Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
  • Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
  • Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
  • Fix return type of MultiIndex.difference (#14009) @galipremsagar
  • Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
  • Fix map column can not be non-nullable for java (#14003) @res-life
  • Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
  • Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
  • Use threadindextype to avoid out of bounds accesses in conditional joins (#13971) @vyasr
  • Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
  • Use cudf::threadindextype in getjsonobject and tdigest kernels (#13962) @nvdbaranec
  • Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
  • Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
  • Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
  • Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
  • Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
  • Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
  • Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
  • Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
  • Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
  • Fixed processed bytes calculation in APPLYBOOLEANMASK benchmark. (#13937) @Blonck
  • Fix construction of Grouping objects (#13932) @galipremsagar
  • Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
  • Fix handling of typecasting in searchsorted (#13925) @galipremsagar
  • Preserve index name in reindex (#13917) @galipremsagar
  • Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
  • Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
  • Use cudf::threadindextype in concatenate.cu. (#13906) @bdice
  • Use cudf::threadindextype in replace.cu. (#13905) @bdice
  • Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
  • Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
  • Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
  • Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
  • Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
  • Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
  • Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
  • Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
  • Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
  • Fix nvtext::generatecharacterngrams performance regression for longer strings (#13874) @davidwendt
  • Fix return type of MultiIndex.levels (#13870) @galipremsagar
  • Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
  • Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
  • Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
  • Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
  • Fix binary operations between Series and Index (#13842) @galipremsagar
  • Update makelistscolumnfromscalar to use makeoffsetschild_column utility (#13841) @davidwendt
  • Fix read out of bounds in string concatenate (#13838) @pentschev
  • Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
  • Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
  • Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
  • Fix cuFile I/O factories (#13829) @vuule
  • DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
  • Branch 23.10 merge 23.08 (#13822) @vyasr
  • Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
  • No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
  • Raise error when mixed types are being constructed (#13816) @galipremsagar
  • Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
  • Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
  • Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
  • Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
  • Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
  • Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
  • Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
  • Fix negative unary operation for boolean type (#13780) @galipremsagar
  • Fix contains(in) method for Series (#13779) @galipremsagar
  • Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
  • Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
  • Preserve names of column object in various APIs (#13772) @galipremsagar
  • Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
  • Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
  • Provide our own Cython declaration for make_unique (#13746) @wence-

πŸ“– Documentation

  • Fix typo in docstring: metadata. (#14025) @bdice
  • Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
  • Simplify Python doc configuration (#13826) @vyasr
  • Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
  • Fix all warnings in Python docs (#13789) @vyasr

πŸš€ New Features

  • [Java] Add JNI bindings for integers_to_hex (#14205) @razajafri
  • Propagate errors from Parquet reader kernels back to host (#14167) @vuule
  • JNI for HISTOGRAM and MERGE_HISTOGRAM aggregations (#14154) @ttnghia
  • Expose streams in all public sorting APIs (#14146) @vyasr
  • Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
  • Implement GroupBy.value_counts to match pandas API (#14114) @stmio
  • Refactor parquet thrift reader (#14097) @etseidl
  • Refactor hash_reduce_by_row (#14095) @ttnghia
  • Support negative preceding/following for ROW window functions (#14093) @mythrocks
  • Support for progressive parquet chunked reading. (#14079) @nvdbaranec
  • Implement HISTOGRAM and MERGE_HISTOGRAM aggregations (#14045) @ttnghia
  • Expose streams in public search APIs (#14034) @vyasr
  • Expose streams in public replace APIs (#14010) @vyasr
  • Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
  • Expose streams in public filling APIs (#13990) @vyasr
  • Expose streams in public concatenate APIs (#13987) @vyasr
  • Use HostMemoryAllocator in jni::allocatehostbuffer (#13975) @gerashegalov
  • Enable fractional null probability for hashing benchmark (#13967) @Blonck
  • Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
  • Add nvtext::tokenizewithvocabulary API (#13930) @davidwendt
  • Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
  • Add HostMemoryAllocator interface (#13924) @gerashegalov
  • Global stream pool (#13922) @etseidl
  • Create tableinputmetadata from a table_metadata (#13920) @etseidl
  • Translate column size overflow exception to JNI (#13911) @mythrocks
  • Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
  • Exclude some tests from running with the compute sanitizer (#13872) @firestarman
  • Expand statistics support in ORC writer (#13848) @vuule
  • Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
  • Add cudf::strings::find function with target per row (#13808) @davidwendt
  • Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
  • Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
  • Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
  • Support corr in GroupBy.apply through the jit engine (#13767) @shwina
  • Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
  • Support more numeric types in Groupby.apply with engine=&#39;jit&#39; (#13729) @brandon-b-miller
  • [FEA] Add DELTABINARYPACKED decoding support to Parquet reader (#13637) @etseidl
  • Read FIXEDLENBYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

πŸ› οΈ Improvements

  • Update shared-action-workflows references (backport from 23.12 to 23.10) (#14300) @AyodeAwe
  • Pin dask and distributed for 23.10 release (#14225) @galipremsagar
  • update rmm tag path (#14195) @AyodeAwe
  • Disable Recently Updated Check (#14193) @ajschmidt8
  • Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
  • Add Parquet reader benchmarks for row selection (#14147) @vuule
  • Update image names (#14145) @AyodeAwe
  • Support callables in DataFrame.assign (#14142) @wence-
  • Reduce memory usage of ascategoricalcolumn (#14138) @wence-
  • Replace Python scalar conversions with libcudf (#14124) @vyasr
  • Update to clang 16.0.6. (#14120) @bdice
  • Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
  • Add stream parameter to external dict APIs (#14115) @SurajAralihalli
  • Add fallback matrix for nvcomp. (#14082) @bdice
  • [Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
  • Remove header tests (#14072) @ajschmidt8
  • Refactor contains_table with cuco::static_set (#14064) @PointKernel
  • Remove debug print in a Parquet test (#14063) @vuule
  • Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
  • Expose stream parameter in public strings find APIs (#14060) @davidwendt
  • Update doxygen to 1.9.1 (#14059) @vyasr
  • Remove the mr from the base fixture (#14057) @vyasr
  • Expose streams in public strings case APIs (#14056) @davidwendt
  • Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
  • Use cudf::makeemptycolumn instead of column_view constructor (#14030) @davidwendt
  • Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
  • Explicitly depend on zlib in conda recipes (#14018) @wence-
  • Use grid_stride for stride computations. (#13996) @bdice
  • Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
  • Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
  • Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
  • Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
  • Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
  • Use thread_index_type in partitioning.cu (#13973) @divyegala
  • Use cudf::thread_index_type in merge.cu (#13972) @divyegala
  • Use copy-pr-bot (#13970) @ajschmidt8
  • Use cudf::threadindextype in strings custom kernels (#13968) @davidwendt
  • Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
  • Added pinned pool reservation API for java (#13964) @revans2
  • Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
  • Add bytes_per_second to copyifelse benchmark (#13960) @Blonck
  • Add pandas compatible output to Series.unique (#13959) @galipremsagar
  • Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
  • Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
  • Make HostColumnVector.getRefCount public (#13934) @abellina
  • Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
  • Add java API to get size of host memory needed to copy column view (#13919) @revans2
  • Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
  • Enable hugepage for arrow host allocations (#13914) @madsbk
  • Improve performance of nvtext::edit_distance (#13912) @davidwendt
  • Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
  • Use empty() instead of size() where possible (#13908) @vuule
  • [JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
  • Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
  • Allow explicit shuffle=&quot;p2p&quot; within dask-cudf API (#13893) @rjzamora
  • Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
  • Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
  • Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
  • Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
  • Fixes a performance regression in FST (#13850) @elstehle
  • Set native handles to null on close in Java wrapper classes (#13818) @jlowe
  • Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
  • Update lists::contains to experimental row comparator (#13810) @divyegala
  • Reduce lists::contains dispatches for scalars (#13805) @divyegala
  • Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
  • Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
  • Remove the libcudf cudf::offset_type type (#13788) @davidwendt
  • Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
  • Update to Cython 3.0.0 (#13777) @vyasr
  • Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
  • Branch 23.10 merge 23.08 (#13773) @vyasr
  • Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
  • Branch 23.10 merge 23.08 (#13753) @vyasr
  • Enforce deprecations in 23.10 (#13732) @galipremsagar
  • Upgrade to arrow 12 (#13728) @galipremsagar
  • Refactors JSON reader's pushdown automaton (#13716) @elstehle
  • Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

- C++
Published by rapids-bot[bot] over 2 years ago

https://github.com/rapidsai/cudf - v23.10.00

🚨 Breaking Changes

  • Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
  • Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
  • Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
  • Create tableinputmetadata from a table_metadata (#13920) @etseidl
  • Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
  • Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
  • Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
  • Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
  • Remove the libcudf cudf::offset_type type (#13788) @davidwendt
  • Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
  • Update to Cython 3.0.0 (#13777) @vyasr
  • Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
  • Enforce deprecations in 23.10 (#13732) @galipremsagar
  • Upgrade to arrow 12 (#13728) @galipremsagar
  • Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

πŸ› Bug Fixes

  • Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
  • Fix inaccuracy in decimal128 rounding. (#14233) @bdice
  • Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
  • Fix pytorch related pytest (#14198) @galipremsagar
  • Pin to aws-sdk-cpp&lt;1.11 (#14173) @pentschev
  • Fix assert failure for range window functions (#14168) @mythrocks
  • Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
  • Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
  • Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
  • Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
  • Fix kernel launch error for cudf::io::orc::gpu::rowgroupcharcounts_kernel (#14139) @davidwendt
  • Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
  • Fix DataFrame.values with no columns but index (#14134) @mroeschke
  • Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
  • Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
  • Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
  • Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
  • Preserve name of the column while initializing a DataFrame (#14110) @galipremsagar
  • Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
  • Drop kwargs from Series.count (#14106) @galipremsagar
  • Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
  • Only use memory resources that haven't been freed (#14103) @robertmaynard
  • Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
  • Validate ignoreindex type in dropduplicates (#14098) @mroeschke
  • Fix renaming Series and Index (#14080) @galipremsagar
  • Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
  • Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
  • Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
  • Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
  • Raise NotImplementedError in todatetime with dayfirst without inferformat (#14058) @mroeschke
  • Fix various issues in Index.intersection (#14054) @galipremsagar
  • Fix Index.difference to match with pandas (#14053) @galipremsagar
  • Fix empty string column construction (#14052) @galipremsagar
  • Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
  • Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
  • Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
  • Ignore compile_commands.json (#14048) @harrism
  • Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
  • Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
  • Implement sort_remaining for sort_index (#14033) @wence-
  • Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
  • Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
  • Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
  • Fix return type of MultiIndex.difference (#14009) @galipremsagar
  • Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
  • Fix map column can not be non-nullable for java (#14003) @res-life
  • Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
  • Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
  • Use threadindextype to avoid out of bounds accesses in conditional joins (#13971) @vyasr
  • Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
  • Use cudf::threadindextype in getjsonobject and tdigest kernels (#13962) @nvdbaranec
  • Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
  • Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
  • Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
  • Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
  • Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
  • Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
  • Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
  • Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
  • Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
  • Fixed processed bytes calculation in APPLYBOOLEANMASK benchmark. (#13937) @Blonck
  • Fix construction of Grouping objects (#13932) @galipremsagar
  • Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
  • Fix handling of typecasting in searchsorted (#13925) @galipremsagar
  • Preserve index name in reindex (#13917) @galipremsagar
  • Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
  • Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
  • Use cudf::threadindextype in concatenate.cu. (#13906) @bdice
  • Use cudf::threadindextype in replace.cu. (#13905) @bdice
  • Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
  • Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
  • Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
  • Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
  • Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
  • Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
  • Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
  • Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
  • Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
  • Fix nvtext::generatecharacterngrams performance regression for longer strings (#13874) @davidwendt
  • Fix return type of MultiIndex.levels (#13870) @galipremsagar
  • Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
  • Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
  • Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
  • Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
  • Fix binary operations between Series and Index (#13842) @galipremsagar
  • Update makelistscolumnfromscalar to use makeoffsetschild_column utility (#13841) @davidwendt
  • Fix read out of bounds in string concatenate (#13838) @pentschev
  • Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
  • Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
  • Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
  • Fix cuFile I/O factories (#13829) @vuule
  • DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
  • Branch 23.10 merge 23.08 (#13822) @vyasr
  • Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
  • No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
  • Raise error when mixed types are being constructed (#13816) @galipremsagar
  • Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
  • Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
  • Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
  • Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
  • Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
  • Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
  • Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
  • Fix negative unary operation for boolean type (#13780) @galipremsagar
  • Fix contains(in) method for Series (#13779) @galipremsagar
  • Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
  • Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
  • Preserve names of column object in various APIs (#13772) @galipremsagar
  • Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
  • Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
  • Provide our own Cython declaration for make_unique (#13746) @wence-

πŸ“– Documentation

  • Fix typo in docstring: metadata. (#14025) @bdice
  • Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
  • Simplify Python doc configuration (#13826) @vyasr
  • Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
  • Fix all warnings in Python docs (#13789) @vyasr

πŸš€ New Features

  • [Java] Add JNI bindings for integers_to_hex (#14205) @razajafri
  • Propagate errors from Parquet reader kernels back to host (#14167) @vuule
  • JNI for HISTOGRAM and MERGE_HISTOGRAM aggregations (#14154) @ttnghia
  • Expose streams in all public sorting APIs (#14146) @vyasr
  • Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
  • Implement GroupBy.value_counts to match pandas API (#14114) @stmio
  • Refactor parquet thrift reader (#14097) @etseidl
  • Refactor hash_reduce_by_row (#14095) @ttnghia
  • Support negative preceding/following for ROW window functions (#14093) @mythrocks
  • Support for progressive parquet chunked reading. (#14079) @nvdbaranec
  • Implement HISTOGRAM and MERGE_HISTOGRAM aggregations (#14045) @ttnghia
  • Expose streams in public search APIs (#14034) @vyasr
  • Expose streams in public replace APIs (#14010) @vyasr
  • Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
  • Expose streams in public filling APIs (#13990) @vyasr
  • Expose streams in public concatenate APIs (#13987) @vyasr
  • Use HostMemoryAllocator in jni::allocatehostbuffer (#13975) @gerashegalov
  • Enable fractional null probability for hashing benchmark (#13967) @Blonck
  • Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
  • Add nvtext::tokenizewithvocabulary API (#13930) @davidwendt
  • Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
  • Add HostMemoryAllocator interface (#13924) @gerashegalov
  • Global stream pool (#13922) @etseidl
  • Create tableinputmetadata from a table_metadata (#13920) @etseidl
  • Translate column size overflow exception to JNI (#13911) @mythrocks
  • Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
  • Exclude some tests from running with the compute sanitizer (#13872) @firestarman
  • Expand statistics support in ORC writer (#13848) @vuule
  • Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
  • Add cudf::strings::find function with target per row (#13808) @davidwendt
  • Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
  • Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
  • Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
  • Support corr in GroupBy.apply through the jit engine (#13767) @shwina
  • Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
  • Support more numeric types in Groupby.apply with engine=&#39;jit&#39; (#13729) @brandon-b-miller
  • [FEA] Add DELTABINARYPACKED decoding support to Parquet reader (#13637) @etseidl
  • Read FIXEDLENBYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

πŸ› οΈ Improvements

  • Pin dask and distributed for 23.10 release (#14225) @galipremsagar
  • update rmm tag path (#14195) @AyodeAwe
  • Disable Recently Updated Check (#14193) @ajschmidt8
  • Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
  • Add Parquet reader benchmarks for row selection (#14147) @vuule
  • Update image names (#14145) @AyodeAwe
  • Support callables in DataFrame.assign (#14142) @wence-
  • Reduce memory usage of ascategoricalcolumn (#14138) @wence-
  • Replace Python scalar conversions with libcudf (#14124) @vyasr
  • Update to clang 16.0.6. (#14120) @bdice
  • Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
  • Add stream parameter to external dict APIs (#14115) @SurajAralihalli
  • Add fallback matrix for nvcomp. (#14082) @bdice
  • [Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
  • Remove header tests (#14072) @ajschmidt8
  • Refactor contains_table with cuco::static_set (#14064) @PointKernel
  • Remove debug print in a Parquet test (#14063) @vuule
  • Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
  • Expose stream parameter in public strings find APIs (#14060) @davidwendt
  • Update doxygen to 1.9.1 (#14059) @vyasr
  • Remove the mr from the base fixture (#14057) @vyasr
  • Expose streams in public strings case APIs (#14056) @davidwendt
  • Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
  • Use cudf::makeemptycolumn instead of column_view constructor (#14030) @davidwendt
  • Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
  • Explicitly depend on zlib in conda recipes (#14018) @wence-
  • Use grid_stride for stride computations. (#13996) @bdice
  • Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
  • Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
  • Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
  • Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
  • Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
  • Use thread_index_type in partitioning.cu (#13973) @divyegala
  • Use cudf::thread_index_type in merge.cu (#13972) @divyegala
  • Use copy-pr-bot (#13970) @ajschmidt8
  • Use cudf::threadindextype in strings custom kernels (#13968) @davidwendt
  • Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
  • Added pinned pool reservation API for java (#13964) @revans2
  • Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
  • Add bytes_per_second to copyifelse benchmark (#13960) @Blonck
  • Add pandas compatible output to Series.unique (#13959) @galipremsagar
  • Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
  • Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
  • Make HostColumnVector.getRefCount public (#13934) @abellina
  • Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
  • Add java API to get size of host memory needed to copy column view (#13919) @revans2
  • Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
  • Enable hugepage for arrow host allocations (#13914) @madsbk
  • Improve performance of nvtext::edit_distance (#13912) @davidwendt
  • Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
  • Use empty() instead of size() where possible (#13908) @vuule
  • [JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
  • Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
  • Allow explicit shuffle=&quot;p2p&quot; within dask-cudf API (#13893) @rjzamora
  • Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
  • Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
  • Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
  • Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
  • Fixes a performance regression in FST (#13850) @elstehle
  • Set native handles to null on close in Java wrapper classes (#13818) @jlowe
  • Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
  • Update lists::contains to experimental row comparator (#13810) @divyegala
  • Reduce lists::contains dispatches for scalars (#13805) @divyegala
  • Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
  • Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
  • Remove the libcudf cudf::offset_type type (#13788) @davidwendt
  • Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
  • Update to Cython 3.0.0 (#13777) @vyasr
  • Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
  • Branch 23.10 merge 23.08 (#13773) @vyasr
  • Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
  • Branch 23.10 merge 23.08 (#13753) @vyasr
  • Enforce deprecations in 23.10 (#13732) @galipremsagar
  • Upgrade to arrow 12 (#13728) @galipremsagar
  • Refactors JSON reader's pushdown automaton (#13716) @elstehle
  • Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

- C++
Published by raydouglass over 2 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v23.12.00

πŸ”— Links

🚨 Breaking Changes

  • Expose stream parameter in public strings convert APIs (#14255) @davidwendt

πŸ› Bug Fixes

  • Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
  • Fix memset error in nvtext::editdistancematrix (#14283) @davidwendt
  • Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
  • Handle empty string correctly in Parquet statistics (#14257) @etseidl
  • Fixes behaviour for incomplete lines when recover_with_nulls is enabled (#14252) @elstehle
  • cudf::detail::pinned_allocator doesn't throw from deallocate (#14251) @robertmaynard
  • Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
  • Fixing parquet list of struct interpretation (#13715) @hyperbolic2346

πŸš€ New Features

  • Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
  • Expose streams in public null mask APIs (#14263) @vyasr
  • Expose streams in binaryop APIs (#14187) @vyasr
  • Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
  • Add DELTABINARYPACKED encoder for Parquet writer (#14100) @etseidl

πŸ› οΈ Improvements

  • Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
  • Update shared-action-workflows references (#14289) @AyodeAwe
  • Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
  • Use branch-23.12 workflows. (#14271) @bdice
  • Refactor LogicalType for Parquet (#14264) @etseidl
  • Centralize chunked reading code in the parquet reader to readerimplchunking.cu (#14262) @nvdbaranec
  • Expose stream parameter in public strings replace APIs (#14261) @davidwendt
  • Expose stream parameter in public strings APIs (#14260) @davidwendt
  • Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
  • Make parquet schema index type consistent (#14256) @hyperbolic2346
  • Expose stream parameter in public strings convert APIs (#14255) @davidwendt
  • Add in java bindings for DataSource (#14254) @revans2
  • Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
  • Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
  • Improve contains_column by invoking contains_table (#14238) @PointKernel
  • Detect and report errors in Parquet header parsing (#14237) @etseidl
  • Forward merge 23.10 into 23.12 (#14231) @galipremsagar
  • Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
  • Enable indexalator for device code (#14206) @davidwendt
  • Marginally reduce memory footprint of joins (#14197) @wence-
  • Add nvtx annotations to spilling-based data movement (#14196) @wence-
  • Remove the use of volatile in ORC (#14175) @vuule
  • Add bytes_per_second to distinctcount of streamcompaction nvbench. (#14172) @Blonck
  • Add bytes_per_second to transpose benchmark (#14170) @Blonck
  • cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
  • Add bytes_per_second to shift benchmark (#13950) @Blonck

- C++
Published by rapids-bot[bot] over 2 years ago

https://github.com/rapidsai/cudf - v23.08.00

🚨 Breaking Changes

  • Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
  • Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
  • Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
  • Expose streams in all public copying APIs (#13629) @vyasr
  • Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
  • Remove deprecated cudf.set_allocator. (#13591) @bdice
  • Change build.sh to use pip install instead of setup.py (#13507) @vyasr
  • Remove unused maxrowstensor parameter from subword tokenizer (#13463) @davidwendt
  • Fix decimal scale reductions in _get_decimal_type (#13224) @charlesbluca

πŸ› Bug Fixes

  • Add CUDA version to cudf_kafka and libcudf-example build strings. (#13769) @bdice
  • Fix typo in wheels-test.yaml. (#13763) @bdice
  • Don't test strings shorter than the requested ngram size (#13758) @vyasr
  • Add CUDA version to custreamz build string. (#13754) @bdice
  • Fix writing of ORC files with empty child string columns (#13745) @vuule
  • Remove the erroneous "empty level" short-circuit from ORC reader (#13722) @vuule
  • Fix character counting when writing sliced tables into ORC (#13721) @vuule
  • Parquet uses row group row count if missing from header (#13712) @hyperbolic2346
  • Fix reading of RLE encoded boolean data from parquet files with V2 page headers (#13707) @etseidl
  • Fix a corner case of list lexicographic comparator (#13701) @ttnghia
  • Fix combined filtering and column projection in dask_cudf.read_parquet (#13697) @rjzamora
  • Revert fetch-rapids changes (#13696) @vyasr
  • Data generator - include offsets in the size estimate of list elments (#13688) @vuule
  • Add cuda-nvcc-impl to cudf for numba CUDA 12 (#13673) @jakirkham
  • Fix combined filtering and column projection in read_parquet (#13666) @rjzamora
  • Use thrust::identity as hash functions for byte pair encoding (#13665) @PointKernel
  • Fix loc-getitem ordering when index contains duplicate labels (#13659) @wence-
  • [REVIEW] Introduce parity with pandas for MultiIndex.loc ordering & fix a bug in Groupby with as_index (#13657) @galipremsagar
  • Fix memcheck error found in nvtext tokenize functions (#13649) @davidwendt
  • Fix has_nonempty_nulls ignoring column offset (#13647) @ttnghia
  • [Java] Avoid double-free corruption in case of an Exception while creating a ColumnView (#13645) @razajafri
  • Fix memcheck error in ORC reader call to cudf::io::copyuncompressedkernel (#13643) @davidwendt
  • Fix CUDA 12 conda environment to remove cubinlinker and ptxcompiler. (#13636) @bdice
  • Fix inf/NaN comparisons for FLOAT orderby in window functions (#13635) @mythrocks
  • Refactor Index search to simplify code and increase correctness (#13625) @wence-
  • Fix compile warning for unused variable in split_re.cu (#13621) @davidwendt
  • Fix tzlocalize for daskcudf Series (#13610) @shwina
  • Fix issue with no decompressed data in ORC reader (#13609) @vuule
  • Fix floating point window range extents. (#13606) @mythrocks
  • Fix localize(None) for timezone-naive columns (#13603) @shwina
  • Fixed a memory leak caused by Exception thrown while constructing a ColumnView (#13597) @razajafri
  • Handle nullptr return value from bitmaskor in distinctcount (#13590) @wence-
  • Bring parity with pandas in Index.join (#13589) @galipremsagar
  • Fix cudf.melt when there are more than 255 columns (#13588) @hcho3
  • Fix memory issues in cuIO due to removal of memory padding (#13586) @ttnghia
  • Fix Parquet multi-file reading (#13584) @etseidl
  • Fix memcheck error found in LISTS_TEST (#13579) @davidwendt
  • Fix memcheck error found in STRINGS_TEST (#13578) @davidwendt
  • Fix memcheck error found in INTEROP_TEST (#13577) @davidwendt
  • Fix memcheck errors found in REDUCTION_TEST (#13574) @davidwendt
  • Preemptive fix for hive-partitioning change in dask (#13564) @rjzamora
  • Fix an issue with dask_cudf.read_csv when lines are needed to be skipped (#13555) @galipremsagar
  • Fix out-of-bounds memory write in cudf::dictionary::detail::concatenate (#13554) @davidwendt
  • Fix the null mask size in json reader (#13537) @karthikeyann
  • Fix cudf::strings::strip for all-empty input column (#13533) @davidwendt
  • Make sure to build without isolation or installing dependencies (#13524) @vyasr
  • Remove preload lib from CMake for now (#13519) @vyasr
  • Fix missing separator after null values in JSON writer (#13503) @karthikeyann
  • Ensure single_lane_block_sum_reduce is safe to call in a loop (#13488) @wence-
  • Update all versions in pyproject.toml files. (#13486) @bdice
  • Remove applying nvbench that doesn't exist in 23.08 (#13484) @robertmaynard
  • Fix chunked Parquet reader benchmark (#13482) @vuule
  • Update JNI JSON reader column compatability for Spark (#13477) @revans2
  • Fix unsanitized output of scan with strings (#13455) @davidwendt
  • Reject functions without bytecode from _can_be_jitted in GroupBy Apply (#13429) @brandon-b-miller
  • Fix decimal scale reductions in _get_decimal_type (#13224) @charlesbluca

πŸ“– Documentation

  • Fix doxygen groups for io data sources and sinks (#13718) @davidwendt
  • Add pandas compatibility note to DataFrame.query docstring (#13693) @beckernick
  • Add pylibcudf to developer guide (#13639) @vyasr
  • Fix repeated words in doxygen text (#13598) @karthikeyann
  • Update docs for top-level API. (#13592) @bdice
  • Fix the the doxygen text for cudf::concatenate and other places (#13561) @davidwendt
  • Document stream validation approach used in testing (#13556) @vyasr
  • Cleanup doc repetitions in libcudf (#13470) @karthikeyann

πŸš€ New Features

  • Support min and max aggregations for list type in groupby and reduction (#13676) @ttnghia
  • Add nvtext::jaccard_index API for strings columns (#13669) @davidwendt
  • Add readparquetmetadata libcudf API (#13663) @karthikeyann
  • Expose streams in all public copying APIs (#13629) @vyasr
  • Add XXHash_64 hash function to cudf (#13612) @davidwendt
  • Java support: Floating point order-by columns for RANGE window functions (#13595) @mythrocks
  • Use cuco::static_map to build string dictionaries in ORC writer (#13580) @vuule
  • Add pylibcudf subpackage with gather implementation (#13562) @vyasr
  • Add JNI for lists::concatenate_list_elements (#13547) @ttnghia
  • Enable nested types for lists::concatenate_list_elements (#13545) @ttnghia
  • Add unicode encoding for string columns in JSON writer (#13539) @karthikeyann
  • Remove numba kernels from find_index_of_val (#13517) @brandon-b-miller
  • Floating point order-by columns for RANGE window functions (#13512) @mythrocks
  • Parse column chunk metadata statistics in parquet reader (#13472) @karthikeyann
  • Add abs function to apply (#13408) @brandon-b-miller
  • [FEA] AST filtering in parquet reader (#13348) @karthikeyann
  • [FEA] Adds option to recover from invalid JSON lines in JSON tokenizer (#13344) @elstehle
  • Ensure cccl packages don't clash with upstream version (#13235) @robertmaynard
  • Update struct_minmax_util to experimental row comparator (#13069) @divyegala
  • Add stream parameter to hashing APIs (#12090) @vyasr

πŸ› οΈ Improvements

  • Pin dask and distributed for 23.08 release (#13802) @galipremsagar
  • Relax protobuf pinnings. (#13770) @bdice
  • Switch fully unbounded window functions to use aggregations (#13727) @mythrocks
  • Switch to new wheel building pipeline (#13723) @vyasr
  • Revert CUDA 12.0 CI workflows to branch-23.08. (#13719) @bdice
  • Adding identify minimum version requirement (#13713) @hyperbolic2346
  • Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
  • Optimize ORC reader performance for list data (#13708) @vyasr
  • fix limit overflow message in a docstring (#13703) @ahmet-uyar
  • Alleviates JSON parser's need for multi-file sources to end with a newline (#13702) @elstehle
  • Update cython-lint and replace flake8 with ruff (#13699) @vyasr
  • Add __dask_tokenize__ definitions to cudf classes (#13695) @rjzamora
  • Convert libcudf hashing benchmarks to nvbench (#13694) @davidwendt
  • Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
  • Improve performance of cudf::strings::split on whitespace (#13680) @davidwendt
  • Allow ORC and Parquet writers to write nullable columns without nulls as non-nullable (#13675) @vuule
  • Raise a NotImplementedError in to_datetime when utc is passed (#13670) @shwina
  • Add rmm_mode parameter to nvbench base fixture (#13668) @davidwendt
  • Fix multiindex loc ordering in pandas-compat mode (#13660) @wence-
  • Add nvtext hashcharacterngrams function (#13654) @davidwendt
  • Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
  • Acquire spill lock in to/from_arrow (#13646) @shwina
  • Expose stable versions of libcudf sort routines (#13634) @wence-
  • Separate out hash_test.cpp source for each hash API (#13633) @davidwendt
  • Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
  • Create separate libcudf hash APIs for each supported hash function (#13626) @davidwendt
  • Add convert_dtypes API (#13623) @shwina
  • Clean up cupy in dependencies.yaml. (#13617) @bdice
  • Use cuda-version to constrain cudatoolkit. (#13615) @bdice
  • Add murmurhash3x64128 function to libcudf (#13604) @davidwendt
  • Performance improvement for cudf::strings::like (#13594) @davidwendt
  • Remove deprecated cudf.set_allocator. (#13591) @bdice
  • Clean up cudf device atomic with cuda::atomic_ref (#13583) @PointKernel
  • Add java bindings for distinct count (#13573) @revans2
  • Use nvcomp conda package. (#13566) @bdice
  • Add exception to stringscalar if input string exceeds sizetype (#13560) @davidwendt
  • Add dispatch for cudf.Dataframe to/from pyarrow.Table conversion (#13558) @rjzamora
  • Get rid of cuco::pair_type aliases (#13553) @PointKernel
  • Introduce parity with pandas when sort=False in Groupby (#13551) @galipremsagar
  • Update CMake in docker to 3.26.4 (#13550) @NvTimLiu
  • Clarify source of error message in stream testing. (#13541) @bdice
  • Deprecate strings_to_categorical in cudf.read_parquet (#13540) @galipremsagar
  • Update to CMake 3.26.4 (#13538) @vyasr
  • s3 folder naming fix (#13536) @AyodeAwe
  • Implement iloc-getitem using parse-don't-validate approach (#13534) @wence-
  • Make synchronization explicit in the names of hostdevice_* copying APIs (#13530) @ttnghia
  • Add benchmark (Google Benchmark) dependency to conda packages. (#13528) @bdice
  • Add libcufile to dependencies.yaml. (#13523) @bdice
  • Fix some memoization logic in groupby/sort/sort_helper.cu (#13521) @davidwendt
  • Use sizestooffsets_iterator in cudf::gather for strings (#13520) @davidwendt
  • use rapids-upload-docs script (#13518) @AyodeAwe
  • Support UTF-8 BOM in CSV reader (#13516) @davidwendt
  • Move stream-related test configuration to CMake (#13513) @vyasr
  • Implement cudf.option_context (#13511) @galipremsagar
  • Unpin dask and distributed for development (#13508) @galipremsagar
  • Change build.sh to use pip install instead of setup.py (#13507) @vyasr
  • Use test default stream (#13506) @vyasr
  • Remove documentation build scripts for Jenkins (#13495) @ajschmidt8
  • Use east const in include files (#13494) @karthikeyann
  • Use east const in src files (#13493) @karthikeyann
  • Use east const in tests files (#13492) @karthikeyann
  • Use east const in benchmarks files (#13491) @karthikeyann
  • Performance improvement for nvtext tokenize/token functions (#13480) @davidwendt
  • Add pd.Float*Dtype to Avro and ORC mappings (#13475) @mroeschke
  • Use pandas public APIs where available (#13467) @mroeschke
  • Allow pd.ArrowDtype in cudf.from_pandas (#13465) @mroeschke
  • Rework libcudf regex benchmarks with nvbench (#13464) @davidwendt
  • Remove unused maxrowstensor parameter from subword tokenizer (#13463) @davidwendt
  • Separate io-text and nvtext pytests into different files (#13435) @davidwendt
  • Add a moveto function to cudf::stringview::const_iterator (#13428) @davidwendt
  • Allow newer scikit-build (#13424) @vyasr
  • Refactor sortbyvalues to sort_values, drop indices from return values. (#13419) @bdice
  • Inline Cython exception handler (#13411) @vyasr
  • Init JNI version 23.08.0-SNAPSHOT (#13401) @pxLi
  • Refactor ORC reader (#13396) @ttnghia
  • JNI: Remove cleaned objects in memory cleaner (#13378) @res-life
  • Add tests of currently unsupported indexing (#13338) @wence-
  • Performance improvement for some libcudf regex functions for long strings (#13322) @davidwendt
  • Exposure Tracked Buffer (first step towards unifying copy-on-write and spilling) (#13307) @madsbk
  • Write string data directly to column_buffer in Parquet reader (#13302) @etseidl
  • Add stacktrace into cudf exception types (#13298) @ttnghia
  • cuDF: Build CUDA 12 packages (#12922) @bdice

- C++
Published by raydouglass over 2 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v23.10.00

πŸ”— Links

🚨 Breaking Changes

  • Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
  • Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
  • Create tableinputmetadata from a table_metadata (#13920) @etseidl
  • Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
  • Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
  • Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
  • Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
  • Remove the libcudf cudf::offset_type type (#13788) @davidwendt
  • Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
  • Update to Cython 3.0.0 (#13777) @vyasr
  • Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
  • Enforce deprecations in 23.10 (#13732) @galipremsagar
  • Upgrade to arrow 12 (#13728) @galipremsagar
  • Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

πŸ› Bug Fixes

  • Fix kernel launch error for cudf::io::orc::gpu::rowgroupcharcounts_kernel (#14139) @davidwendt
  • Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
  • Fix DataFrame.values with no columns but index (#14134) @mroeschke
  • Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
  • Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
  • Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
  • Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
  • Drop kwargs from Series.count (#14106) @galipremsagar
  • Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
  • Only use memory resources that haven't been freed (#14103) @robertmaynard
  • Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
  • Validate ignoreindex type in dropduplicates (#14098) @mroeschke
  • Fix renaming Series and Index (#14080) @galipremsagar
  • Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
  • Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
  • Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
  • Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
  • Raise NotImplementedError in todatetime with dayfirst without inferformat (#14058) @mroeschke
  • Fix various issues in Index.intersection (#14054) @galipremsagar
  • Fix Index.difference to match with pandas (#14053) @galipremsagar
  • Fix empty string column construction (#14052) @galipremsagar
  • Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
  • Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
  • Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
  • Ignore compile_commands.json (#14048) @harrism
  • Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
  • Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
  • Implement sort_remaining for sort_index (#14033) @wence-
  • Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
  • Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
  • Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
  • Fix return type of MultiIndex.difference (#14009) @galipremsagar
  • Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
  • Fix map column can not be non-nullable for java (#14003) @res-life
  • Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
  • Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
  • Use threadindextype to avoid out of bounds accesses in conditional joins (#13971) @vyasr
  • Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
  • Use cudf::threadindextype in getjsonobject and tdigest kernels (#13962) @nvdbaranec
  • Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
  • Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
  • Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
  • Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
  • Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
  • Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
  • Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
  • Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
  • Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
  • Fixed processed bytes calculation in APPLYBOOLEANMASK benchmark. (#13937) @Blonck
  • Fix construction of Grouping objects (#13932) @galipremsagar
  • Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
  • Fix handling of typecasting in searchsorted (#13925) @galipremsagar
  • Preserve index name in reindex (#13917) @galipremsagar
  • Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
  • Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
  • Use cudf::threadindextype in concatenate.cu. (#13906) @bdice
  • Use cudf::threadindextype in replace.cu. (#13905) @bdice
  • Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
  • Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
  • Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
  • Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
  • Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
  • Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
  • Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
  • Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
  • Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
  • Fix nvtext::generatecharacterngrams performance regression for longer strings (#13874) @davidwendt
  • Fix return type of MultiIndex.levels (#13870) @galipremsagar
  • Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
  • Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
  • Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
  • Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
  • Fix binary operations between Series and Index (#13842) @galipremsagar
  • Update makelistscolumnfromscalar to use makeoffsetschild_column utility (#13841) @davidwendt
  • Fix read out of bounds in string concatenate (#13838) @pentschev
  • Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
  • Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
  • Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
  • Fix cuFile I/O factories (#13829) @vuule
  • DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
  • Branch 23.10 merge 23.08 (#13822) @vyasr
  • Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
  • No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
  • Raise error when mixed types are being constructed (#13816) @galipremsagar
  • Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
  • Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
  • Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
  • Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
  • Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
  • Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
  • Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
  • Fix negative unary operation for boolean type (#13780) @galipremsagar
  • Fix contains(in) method for Series (#13779) @galipremsagar
  • Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
  • Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
  • Preserve names of column object in various APIs (#13772) @galipremsagar
  • Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
  • Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
  • Provide our own Cython declaration for make_unique (#13746) @wence-

πŸ“– Documentation

  • Fix typo in docstring: metadata. (#14025) @bdice
  • Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
  • Simplify Python doc configuration (#13826) @vyasr
  • Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
  • Fix all warnings in Python docs (#13789) @vyasr

πŸš€ New Features

  • Implement GroupBy.value_counts to match pandas API (#14114) @stmio
  • Refactor parquet thrift reader (#14097) @etseidl
  • Refactor hash_reduce_by_row (#14095) @ttnghia
  • Support negative preceding/following for ROW window functions (#14093) @mythrocks
  • Expose streams in public search APIs (#14034) @vyasr
  • Expose streams in public replace APIs (#14010) @vyasr
  • Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
  • Expose streams in public filling APIs (#13990) @vyasr
  • Expose streams in public concatenate APIs (#13987) @vyasr
  • Use HostMemoryAllocator in jni::allocatehostbuffer (#13975) @gerashegalov
  • Enable fractional null probability for hashing benchmark (#13967) @Blonck
  • Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
  • Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
  • Add HostMemoryAllocator interface (#13924) @gerashegalov
  • Global stream pool (#13922) @etseidl
  • Create tableinputmetadata from a table_metadata (#13920) @etseidl
  • Translate column size overflow exception to JNI (#13911) @mythrocks
  • Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
  • Exclude some tests from running with the compute sanitizer (#13872) @firestarman
  • Expand statistics support in ORC writer (#13848) @vuule
  • Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
  • Add cudf::strings::find function with target per row (#13808) @davidwendt
  • Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
  • Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
  • Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
  • Support corr in GroupBy.apply through the jit engine (#13767) @shwina
  • Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
  • Support more numeric types in Groupby.apply with engine=&#39;jit&#39; (#13729) @brandon-b-miller
  • [FEA] Add DELTABINARYPACKED decoding support to Parquet reader (#13637) @etseidl
  • Read FIXEDLENBYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

πŸ› οΈ Improvements

  • Reduce memory usage of ascategoricalcolumn (#14138) @wence-
  • Update to clang 16.0.6. (#14120) @bdice
  • Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
  • Add fallback matrix for nvcomp. (#14082) @bdice
  • [Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
  • Remove header tests (#14072) @ajschmidt8
  • Remove debug print in a Parquet test (#14063) @vuule
  • Expose stream parameter in public strings find APIs (#14060) @davidwendt
  • Update doxygen to 1.9.1 (#14059) @vyasr
  • Remove the mr from the base fixture (#14057) @vyasr
  • Expose streams in public strings case APIs (#14056) @davidwendt
  • Use cudf::makeemptycolumn instead of column_view constructor (#14030) @davidwendt
  • Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
  • Explicitly depend on zlib in conda recipes (#14018) @wence-
  • Use grid_stride for stride computations. (#13996) @bdice
  • Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
  • Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
  • Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
  • Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
  • Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
  • Use thread_index_type in partitioning.cu (#13973) @divyegala
  • Use cudf::thread_index_type in merge.cu (#13972) @divyegala
  • Use copy-pr-bot (#13970) @ajschmidt8
  • Use cudf::threadindextype in strings custom kernels (#13968) @davidwendt
  • Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
  • Added pinned pool reservation API for java (#13964) @revans2
  • Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
  • Add bytes_per_second to copyifelse benchmark (#13960) @Blonck
  • Add pandas compatible output to Series.unique (#13959) @galipremsagar
  • Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
  • Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
  • Make HostColumnVector.getRefCount public (#13934) @abellina
  • Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
  • Add java API to get size of host memory needed to copy column view (#13919) @revans2
  • Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
  • Enable hugepage for arrow host allocations (#13914) @madsbk
  • Improve performance of nvtext::edit_distance (#13912) @davidwendt
  • Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
  • Use empty() instead of size() where possible (#13908) @vuule
  • [JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
  • Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
  • Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
  • Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
  • Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
  • Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
  • Fixes a performance regression in FST (#13850) @elstehle
  • Set native handles to null on close in Java wrapper classes (#13818) @jlowe
  • Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
  • Update lists::contains to experimental row comparator (#13810) @divyegala
  • Reduce lists::contains dispatches for scalars (#13805) @divyegala
  • Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
  • Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
  • Remove the libcudf cudf::offset_type type (#13788) @davidwendt
  • Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
  • Update to Cython 3.0.0 (#13777) @vyasr
  • Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
  • Branch 23.10 merge 23.08 (#13773) @vyasr
  • Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
  • Branch 23.10 merge 23.08 (#13753) @vyasr
  • Enforce deprecations in 23.10 (#13732) @galipremsagar
  • Upgrade to arrow 12 (#13728) @galipremsagar
  • Refactors JSON reader's pushdown automaton (#13716) @elstehle
  • Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

- C++
Published by rapids-bot[bot] over 2 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v23.06.00

πŸ”— Links

🚨 Breaking Changes

  • Fix batch processing for parquet writer (#13438) @ttnghia
  • Use <NA> instead of null to match pandas. (#13415) @bdice
  • Remove UNKNOWNNULLCOUNT (#13372) @vyasr
  • Remove default UNKNOWNNULLCOUNT from cudf::column member functions (#13341) @davidwendt
  • Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
  • Remove null mask and null count from column_view constructors (#13311) @vyasr
  • Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
  • Throw error if UNINITIALIZED is passed to cudf::statenullcount (#13292) @davidwendt
  • Remove default null-count parameter from cudf::makestringscolumn factory (#13227) @davidwendt
  • Remove UNKNOWNNULLCOUNT where it can be easily computed (#13205) @vyasr
  • Update minimum Python version to Python 3.9 (#13196) @shwina
  • Refactor contiguoussplit API into contiguoussplit.hpp (#13186) @abellina
  • Cleanup Parquet chunked writer (#13094) @ttnghia
  • Cleanup ORC chunked writer (#13091) @ttnghia
  • Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
  • Remove deprecated regex functions from libcudf (#13067) @davidwendt
  • [REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
  • Implement Python dropduplicates with cudf::stabledistinct. (#11656) @brandon-b-miller

πŸ› Bug Fixes

  • Fix valid count computation in offsetbitmaskbinop kernel (#13489) @davidwendt
  • Fix writing of ORC files with empty rowgroups (#13466) @vuule
  • Fix cudf::repeat logic when count is zero (#13459) @davidwendt
  • Fix batch processing for parquet writer (#13438) @ttnghia
  • Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
  • Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
  • Fix cudf::strings::replacewithbackrefs hang on empty match result (#13418) @davidwendt
  • Use <NA> instead of null to match pandas. (#13415) @bdice
  • Fix tokenize with non-space delimiter (#13403) @shwina
  • Fix groupby head/tail for empty dataframe (#13398) @shwina
  • Default to closed="right" in IntervalIndex constructor (#13394) @shwina
  • Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
  • Fix unused argument errors in nvcc 11.5 (#13387) @abellina
  • Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
  • Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
  • Fix page size estimation in Parquet writer (#13364) @etseidl
  • Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
  • Support gcc 12 as the C++ compiler (#13316) @robertmaynard
  • Correctly set bitmask size in from_column_view (#13315) @wence-
  • Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
  • Fix parquet schema interpretation issue (#13277) @hyperbolic2346
  • Fix 64bit shift bug in avro reader (#13276) @karthikeyann
  • Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
  • Clean up buffers in case AssertionError (#13262) @razajafri
  • Allow empty input table in ast compute_column (#13245) @wence-
  • Fix structscolumnwrapper constructors to copy input column wrappers (#13243) @davidwendt
  • Fix the row index stream order in ORC reader (#13242) @vuule
  • Make is_decompression_disabled and is_compression_disabled thread-safe (#13240) @vuule
  • Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
  • Fix race in ORC string dictionary creation (#13214) @revans2
  • Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
  • Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
  • Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
  • Fix hostdevice_vector::subspan (#13187) @ttnghia
  • Use custom nvbench entry point to ensure cudf::nvbench_base_fixture usage (#13183) @robertmaynard
  • Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
  • Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
  • Fix a few clang-format style check errors (#13146) @davidwendt
  • [REVIEW] Fix Series and DataFrame constructors to validate index lengths (#13122) @galipremsagar
  • Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
  • Fix GPUARCHS setting in Java CMake build and CMAKECUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
  • Adds checks to make sure json reader won't overflow (#13115) @elstehle
  • Fix null_count of columns returned by chunked_parquet_reader (#13111) @vuule
  • Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
  • [REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
  • Use makeemptylistscolumn instead of makeemptycolumn(typeid::LIST) (#13099) @davidwendt
  • Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
  • Fix column selection read_parquet benchmarks (#13082) @vuule
  • Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
  • Add algorithm include in data_sink.hpp (#13068) @ahendriksen
  • Fix tests/identifystreamusage.cpp (#13066) @ahendriksen
  • Prevent overflow with skip_rows in ORC and Parquet readers (#13063) @vuule
  • Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
  • [REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
  • Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
  • Fix readavro() skiprows and num_rows. (#12912) @tpn
  • Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
  • Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina

πŸš€ New Features

  • Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
  • Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
  • Use compileor_get in JIT groupby apply (#13350) @brandon-b-miller
  • cuDF numba cuda 12 updates (#13337) @brandon-b-miller
  • Add tz_convert method to convert between timestamps (#13328) @shwina
  • Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
  • Support the case=False argument to str.contains (#13290) @shwina
  • Add an event handler for ColumnVector.close (#13279) @abellina
  • JNI api for cudf::chunked_pack (#13278) @abellina
  • Implement a chunked_pack API (#13260) @abellina
  • Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
  • JNI changes for range-extents in window functions. (#13199) @mythrocks
  • Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
  • Add IS_NULL operator to AST (#13145) @karthikeyann
  • STRING order-by column for RANGE window functions (#13143) @mythrocks
  • Update contains_table to experimental row hasher and equality comparator (#13119) @divyegala
  • Automatically select GroupBy.apply algorithm based on if the UDF is jittable (#13113) @brandon-b-miller
  • Refactor Parquet chunked writer (#13076) @ttnghia
  • Add Python bindings for string literal support in AST (#13073) @karthikeyann
  • Add Java bindings for string literal support in AST (#13072) @karthikeyann
  • Add string scalar support in AST (#13061) @karthikeyann
  • Log cuIO warnings using the libcudf logger (#13043) @vuule
  • Update mixed_join to use experimental row hasher and comparator (#13028) @divyegala
  • Support structs of lists in row lexicographic comparator (#13005) @ttnghia
  • Adding hostdevice_span that is a span createable from hostdevice_vector (#12981) @hyperbolic2346
  • Add nvtext::minhash function (#12961) @davidwendt
  • Support lists of structs in row lexicographic comparator (#12953) @ttnghia
  • Update join to use experimental row hasher and comparator (#12787) @divyegala
  • Implement Python dropduplicates with cudf::stabledistinct. (#11656) @brandon-b-miller

πŸ› οΈ Improvements

  • Bump typing_extensions minimum version to 4.0.0 (#13618) @shwina
  • Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
  • Handle some corner-cases in indexing with boolean masks (#13402) @wence-
  • Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
  • [JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
  • Fix JNI method with mismatched parameter list (#13384) @ttnghia
  • Split up experimentalrowoperator_tests.cu to improve its compile time (#13382) @davidwendt
  • Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
  • Remove UNKNOWNNULLCOUNT (#13372) @vyasr
  • Move some nvtext benchmarks to nvbench (#13368) @davidwendt
  • run docs nightly too (#13366) @AyodeAwe
  • Add warning for default dtype parameter in get_dummies (#13365) @galipremsagar
  • Add log messages about kvikIO compatibility mode (#13363) @vuule
  • Switch back to using primary shared-action-workflows branch (#13362) @vyasr
  • Deprecate StringIndex and use Index instead (#13361) @galipremsagar
  • Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
  • Expunge most uses of TypeVar(bound=&quot;Foo&quot;) (#13346) @wence-
  • Remove all references to UNKNOWNNULLCOUNT in Python (#13345) @vyasr
  • Improve distinct_count with cuco::static_set (#13343) @PointKernel
  • Fix contiguous_split performance (#13342) @ttnghia
  • Remove default UNKNOWNNULLCOUNT from cudf::column member functions (#13341) @davidwendt
  • Update mypy to 1.3 (#13340) @wence-
  • [Java] Purge non-empty nulls when setting validity (#13335) @razajafri
  • Add row-wise filtering step to read_parquet (#13334) @rjzamora
  • Performance improvement for nvtext::minhash (#13333) @davidwendt
  • Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
  • Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
  • Move meta calculation in dask_cudf.read_parquet (#13327) @rjzamora
  • Changes to support Numpy >= 1.24 (#13325) @shwina
  • Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
  • Clean up distinct_count benchmark (#13321) @PointKernel
  • Fix gtest pinning to 1.13.0. (#13319) @bdice
  • Remove null mask and null count from column_view constructors (#13311) @vyasr
  • Address feedback from 13289 (#13306) @vyasr
  • Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
  • First check for BaseDtype when infering the data type of an arbitrary object (#13295) @shwina
  • Throw error if UNINITIALIZED is passed to cudf::statenullcount (#13292) @davidwendt
  • Support CUDA 12.0 for pip wheels (#13289) @divyegala
  • Refactor transform_lists_of_structs in row_operators.cu (#13288) @ttnghia
  • Branch 23.06 merge 23.04 (#13286) @vyasr
  • Update cupy dependency (#13284) @vyasr
  • Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
  • Fix unused variables and functions (#13275) @karthikeyann
  • Fix integer overflow in partition scatter_map construction (#13272) @wence-
  • Numba 0.57 compatibility fixes (#13271) @gmarkall
  • Performance improvement in cudf::strings::allcharactersof_type (#13259) @davidwendt
  • Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
  • Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
  • Build wheels using new single image workflow (#13249) @vyasr
  • Enable sccache hits from local builds (#13248) @AyodeAwe
  • Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
  • Introduce pandas_compatible option in cudf (#13241) @galipremsagar
  • Add metadata_builder helper class (#13232) @abellina
  • Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
  • Remove default null-count parameter from cudf::makestringscolumn factory (#13227) @davidwendt
  • Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
  • Add chunked reader benchmark (#13223) @SrikarVanavasam
  • Set the null count in output columns in the CSV reader (#13221) @vuule
  • Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
  • Fix stringscalar stream usage in writejson.cu (#13212) @davidwendt
  • Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
  • Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
  • Remove UNKNOWNNULLCOUNT where it can be easily computed (#13205) @vyasr
  • Optimization to decoding of parquet level streams (#13203) @nvdbaranec
  • Clean up and simplify gpuDecideCompression (#13202) @vuule
  • Use std::array for a statically sized vector in create_serialized_trie (#13201) @vuule
  • Update minimum Python version to Python 3.9 (#13196) @shwina
  • Refactor contiguoussplit API into contiguoussplit.hpp (#13186) @abellina
  • Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
  • Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
  • Split up unique_count.cu to improve build time (#13169) @davidwendt
  • Use nvtx3 includes in string examples. (#13165) @bdice
  • Change some .cu gtest files to .cpp (#13155) @davidwendt
  • Remove wheel pytest verbosity (#13151) @sevagh
  • Fix libcudf to always pass null-count to setnullmask (#13149) @davidwendt
  • Fix gtests to always pass null-count to setnullmask calls (#13148) @davidwendt
  • Optimize JSON writer (#13144) @karthikeyann
  • Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
  • [REVIEW] Deprecate pad and backfill methods (#13140) @galipremsagar
  • Use CTAD instead of functions in ProtobufReader (#13135) @vuule
  • Remove more instances of UNKNOWN_NULL_COUNT (#13134) @vyasr
  • Update clang-format to 16.0.1. (#13133) @bdice
  • Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
  • Branch 23.06 merge 23.04 (#13131) @vyasr
  • Compute null-count in cudf::detail::slice (#13124) @davidwendt
  • Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
  • Set null-count in linkedcolumnview conversion operator (#13121) @davidwendt
  • Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
  • Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
  • Remove uses-setup-env-vars (#13105) @vyasr
  • Explicitly compute null count in concatenate APIs (#13104) @vyasr
  • Replace unnecessary uses of UNKNOWN_NULL_COUNT (#13102) @vyasr
  • Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
  • Use .element() instead of .data() for window range calculations (#13095) @mythrocks
  • Cleanup Parquet chunked writer (#13094) @ttnghia
  • Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
  • Cleanup ORC chunked writer (#13091) @ttnghia
  • Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
  • Change cudf::test::makenullmask to also return null-count (#13081) @davidwendt
  • Resolved automerger from branch-23.04 to branch-23.06 (#13080) @galipremsagar
  • Assert for non-empty nulls (#13071) @razajafri
  • Remove deprecated regex functions from libcudf (#13067) @davidwendt
  • Refactor cudf::detail::sorted_order (#13062) @ttnghia
  • Improve performance of slice_strings for long strings (#13057) @davidwendt
  • Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
  • [REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
  • Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
  • Remove console output from some libcudf gtests (#13027) @davidwendt
  • Remove underscore in build string. (#13025) @bdice
  • Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
  • Fix auto merger from branch-23.04 to branch-23.06 (#13009) @galipremsagar
  • Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
  • Add nvtx annotatations to groupby methods (#12941) @wence-
  • Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
  • Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
  • Optimize set-like operations (#12769) @ttnghia
  • [REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
  • Add empty test files for test reorganization (#12288) @shwina

- C++
Published by rapids-bot[bot] over 2 years ago

https://github.com/rapidsai/cudf - v23.06.01

🚨 Breaking Changes

  • Fix batch processing for parquet writer (#13438) @ttnghia
  • Use <NA> instead of null to match pandas. (#13415) @bdice
  • Remove UNKNOWNNULLCOUNT (#13372) @vyasr
  • Remove default UNKNOWNNULLCOUNT from cudf::column member functions (#13341) @davidwendt
  • Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
  • Remove null mask and null count from column_view constructors (#13311) @vyasr
  • Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
  • Throw error if UNINITIALIZED is passed to cudf::statenullcount (#13292) @davidwendt
  • Remove default null-count parameter from cudf::makestringscolumn factory (#13227) @davidwendt
  • Remove UNKNOWNNULLCOUNT where it can be easily computed (#13205) @vyasr
  • Update minimum Python version to Python 3.9 (#13196) @shwina
  • Refactor contiguoussplit API into contiguoussplit.hpp (#13186) @abellina
  • Cleanup Parquet chunked writer (#13094) @ttnghia
  • Cleanup ORC chunked writer (#13091) @ttnghia
  • Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
  • Remove deprecated regex functions from libcudf (#13067) @davidwendt
  • [REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
  • Implement Python dropduplicates with cudf::stabledistinct. (#11656) @brandon-b-miller

πŸ› Bug Fixes

  • Fix valid count computation in offsetbitmaskbinop kernel (#13489) @davidwendt
  • Fix writing of ORC files with empty rowgroups (#13466) @vuule
  • Fix cudf::repeat logic when count is zero (#13459) @davidwendt
  • Fix batch processing for parquet writer (#13438) @ttnghia
  • Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
  • Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
  • Fix cudf::strings::replacewithbackrefs hang on empty match result (#13418) @davidwendt
  • Use <NA> instead of null to match pandas. (#13415) @bdice
  • Fix tokenize with non-space delimiter (#13403) @shwina
  • Fix groupby head/tail for empty dataframe (#13398) @shwina
  • Default to closed="right" in IntervalIndex constructor (#13394) @shwina
  • Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
  • Fix unused argument errors in nvcc 11.5 (#13387) @abellina
  • Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
  • Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
  • Fix page size estimation in Parquet writer (#13364) @etseidl
  • Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
  • Support gcc 12 as the C++ compiler (#13316) @robertmaynard
  • Correctly set bitmask size in from_column_view (#13315) @wence-
  • Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
  • Fix parquet schema interpretation issue (#13277) @hyperbolic2346
  • Fix 64bit shift bug in avro reader (#13276) @karthikeyann
  • Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
  • Clean up buffers in case AssertionError (#13262) @razajafri
  • Allow empty input table in ast compute_column (#13245) @wence-
  • Fix structscolumnwrapper constructors to copy input column wrappers (#13243) @davidwendt
  • Fix the row index stream order in ORC reader (#13242) @vuule
  • Make is_decompression_disabled and is_compression_disabled thread-safe (#13240) @vuule
  • Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
  • Fix race in ORC string dictionary creation (#13214) @revans2
  • Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
  • Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
  • Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
  • Fix hostdevice_vector::subspan (#13187) @ttnghia
  • Use custom nvbench entry point to ensure cudf::nvbench_base_fixture usage (#13183) @robertmaynard
  • Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
  • Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
  • Fix a few clang-format style check errors (#13146) @davidwendt
  • [REVIEW] Fix Series and DataFrame constructors to validate index lengths (#13122) @galipremsagar
  • Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
  • Fix GPUARCHS setting in Java CMake build and CMAKECUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
  • Adds checks to make sure json reader won't overflow (#13115) @elstehle
  • Fix null_count of columns returned by chunked_parquet_reader (#13111) @vuule
  • Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
  • [REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
  • Use makeemptylistscolumn instead of makeemptycolumn(typeid::LIST) (#13099) @davidwendt
  • Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
  • Fix column selection read_parquet benchmarks (#13082) @vuule
  • Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
  • Add algorithm include in data_sink.hpp (#13068) @ahendriksen
  • Fix tests/identifystreamusage.cpp (#13066) @ahendriksen
  • Prevent overflow with skip_rows in ORC and Parquet readers (#13063) @vuule
  • Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
  • [REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
  • Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
  • Fix readavro() skiprows and num_rows. (#12912) @tpn
  • Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
  • Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina

πŸš€ New Features

  • Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
  • Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
  • Use compileor_get in JIT groupby apply (#13350) @brandon-b-miller
  • cuDF numba cuda 12 updates (#13337) @brandon-b-miller
  • Add tz_convert method to convert between timestamps (#13328) @shwina
  • Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
  • Support the case=False argument to str.contains (#13290) @shwina
  • Add an event handler for ColumnVector.close (#13279) @abellina
  • JNI api for cudf::chunked_pack (#13278) @abellina
  • Implement a chunked_pack API (#13260) @abellina
  • Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
  • JNI changes for range-extents in window functions. (#13199) @mythrocks
  • Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
  • Add IS_NULL operator to AST (#13145) @karthikeyann
  • STRING order-by column for RANGE window functions (#13143) @mythrocks
  • Update contains_table to experimental row hasher and equality comparator (#13119) @divyegala
  • Automatically select GroupBy.apply algorithm based on if the UDF is jittable (#13113) @brandon-b-miller
  • Refactor Parquet chunked writer (#13076) @ttnghia
  • Add Python bindings for string literal support in AST (#13073) @karthikeyann
  • Add Java bindings for string literal support in AST (#13072) @karthikeyann
  • Add string scalar support in AST (#13061) @karthikeyann
  • Log cuIO warnings using the libcudf logger (#13043) @vuule
  • Update mixed_join to use experimental row hasher and comparator (#13028) @divyegala
  • Support structs of lists in row lexicographic comparator (#13005) @ttnghia
  • Adding hostdevice_span that is a span createable from hostdevice_vector (#12981) @hyperbolic2346
  • Add nvtext::minhash function (#12961) @davidwendt
  • Support lists of structs in row lexicographic comparator (#12953) @ttnghia
  • Update join to use experimental row hasher and comparator (#12787) @divyegala
  • Implement Python dropduplicates with cudf::stabledistinct. (#11656) @brandon-b-miller

πŸ› οΈ Improvements

  • Bump typing_extensions minimum version to 4.0.0 (#13618) @shwina
  • Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
  • Handle some corner-cases in indexing with boolean masks (#13402) @wence-
  • Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
  • [JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
  • Fix JNI method with mismatched parameter list (#13384) @ttnghia
  • Split up experimentalrowoperator_tests.cu to improve its compile time (#13382) @davidwendt
  • Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
  • Remove UNKNOWNNULLCOUNT (#13372) @vyasr
  • Move some nvtext benchmarks to nvbench (#13368) @davidwendt
  • run docs nightly too (#13366) @AyodeAwe
  • Add warning for default dtype parameter in get_dummies (#13365) @galipremsagar
  • Add log messages about kvikIO compatibility mode (#13363) @vuule
  • Switch back to using primary shared-action-workflows branch (#13362) @vyasr
  • Deprecate StringIndex and use Index instead (#13361) @galipremsagar
  • Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
  • Expunge most uses of TypeVar(bound=&quot;Foo&quot;) (#13346) @wence-
  • Remove all references to UNKNOWNNULLCOUNT in Python (#13345) @vyasr
  • Improve distinct_count with cuco::static_set (#13343) @PointKernel
  • Fix contiguous_split performance (#13342) @ttnghia
  • Remove default UNKNOWNNULLCOUNT from cudf::column member functions (#13341) @davidwendt
  • Update mypy to 1.3 (#13340) @wence-
  • [Java] Purge non-empty nulls when setting validity (#13335) @razajafri
  • Add row-wise filtering step to read_parquet (#13334) @rjzamora
  • Performance improvement for nvtext::minhash (#13333) @davidwendt
  • Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
  • Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
  • Move meta calculation in dask_cudf.read_parquet (#13327) @rjzamora
  • Changes to support Numpy >= 1.24 (#13325) @shwina
  • Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
  • Clean up distinct_count benchmark (#13321) @PointKernel
  • Fix gtest pinning to 1.13.0. (#13319) @bdice
  • Remove null mask and null count from column_view constructors (#13311) @vyasr
  • Address feedback from 13289 (#13306) @vyasr
  • Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
  • First check for BaseDtype when infering the data type of an arbitrary object (#13295) @shwina
  • Throw error if UNINITIALIZED is passed to cudf::statenullcount (#13292) @davidwendt
  • Support CUDA 12.0 for pip wheels (#13289) @divyegala
  • Refactor transform_lists_of_structs in row_operators.cu (#13288) @ttnghia
  • Branch 23.06 merge 23.04 (#13286) @vyasr
  • Update cupy dependency (#13284) @vyasr
  • Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
  • Fix unused variables and functions (#13275) @karthikeyann
  • Fix integer overflow in partition scatter_map construction (#13272) @wence-
  • Numba 0.57 compatibility fixes (#13271) @gmarkall
  • Performance improvement in cudf::strings::allcharactersof_type (#13259) @davidwendt
  • Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
  • Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
  • Build wheels using new single image workflow (#13249) @vyasr
  • Enable sccache hits from local builds (#13248) @AyodeAwe
  • Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
  • Introduce pandas_compatible option in cudf (#13241) @galipremsagar
  • Add metadata_builder helper class (#13232) @abellina
  • Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
  • Remove default null-count parameter from cudf::makestringscolumn factory (#13227) @davidwendt
  • Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
  • Add chunked reader benchmark (#13223) @SrikarVanavasam
  • Set the null count in output columns in the CSV reader (#13221) @vuule
  • Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
  • Fix stringscalar stream usage in writejson.cu (#13212) @davidwendt
  • Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
  • Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
  • Remove UNKNOWNNULLCOUNT where it can be easily computed (#13205) @vyasr
  • Optimization to decoding of parquet level streams (#13203) @nvdbaranec
  • Clean up and simplify gpuDecideCompression (#13202) @vuule
  • Use std::array for a statically sized vector in create_serialized_trie (#13201) @vuule
  • Update minimum Python version to Python 3.9 (#13196) @shwina
  • Refactor contiguoussplit API into contiguoussplit.hpp (#13186) @abellina
  • Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
  • Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
  • Split up unique_count.cu to improve build time (#13169) @davidwendt
  • Use nvtx3 includes in string examples. (#13165) @bdice
  • Change some .cu gtest files to .cpp (#13155) @davidwendt
  • Remove wheel pytest verbosity (#13151) @sevagh
  • Fix libcudf to always pass null-count to setnullmask (#13149) @davidwendt
  • Fix gtests to always pass null-count to setnullmask calls (#13148) @davidwendt
  • Optimize JSON writer (#13144) @karthikeyann
  • Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
  • [REVIEW] Deprecate pad and backfill methods (#13140) @galipremsagar
  • Use CTAD instead of functions in ProtobufReader (#13135) @vuule
  • Remove more instances of UNKNOWN_NULL_COUNT (#13134) @vyasr
  • Update clang-format to 16.0.1. (#13133) @bdice
  • Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
  • Branch 23.06 merge 23.04 (#13131) @vyasr
  • Compute null-count in cudf::detail::slice (#13124) @davidwendt
  • Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
  • Set null-count in linkedcolumnview conversion operator (#13121) @davidwendt
  • Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
  • Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
  • Remove uses-setup-env-vars (#13105) @vyasr
  • Explicitly compute null count in concatenate APIs (#13104) @vyasr
  • Replace unnecessary uses of UNKNOWN_NULL_COUNT (#13102) @vyasr
  • Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
  • Use .element() instead of .data() for window range calculations (#13095) @mythrocks
  • Cleanup Parquet chunked writer (#13094) @ttnghia
  • Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
  • Cleanup ORC chunked writer (#13091) @ttnghia
  • Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
  • Change cudf::test::makenullmask to also return null-count (#13081) @davidwendt
  • Resolved automerger from branch-23.04 to branch-23.06 (#13080) @galipremsagar
  • Assert for non-empty nulls (#13071) @razajafri
  • Remove deprecated regex functions from libcudf (#13067) @davidwendt
  • Refactor cudf::detail::sorted_order (#13062) @ttnghia
  • Improve performance of slice_strings for long strings (#13057) @davidwendt
  • Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
  • [REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
  • Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
  • Remove console output from some libcudf gtests (#13027) @davidwendt
  • Remove underscore in build string. (#13025) @bdice
  • Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
  • Fix auto merger from branch-23.04 to branch-23.06 (#13009) @galipremsagar
  • Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
  • Add nvtx annotatations to groupby methods (#12941) @wence-
  • Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
  • Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
  • Optimize set-like operations (#12769) @ttnghia
  • [REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
  • Add empty test files for test reorganization (#12288) @shwina

- C++
Published by raydouglass over 2 years ago

https://github.com/rapidsai/cudf - v23.06.00

🚨 Breaking Changes

  • Fix batch processing for parquet writer (#13438) @ttnghia
  • Use <NA> instead of null to match pandas. (#13415) @bdice
  • Remove UNKNOWNNULLCOUNT (#13372) @vyasr
  • Remove default UNKNOWNNULLCOUNT from cudf::column member functions (#13341) @davidwendt
  • Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
  • Remove null mask and null count from column_view constructors (#13311) @vyasr
  • Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
  • Throw error if UNINITIALIZED is passed to cudf::statenullcount (#13292) @davidwendt
  • Remove default null-count parameter from cudf::makestringscolumn factory (#13227) @davidwendt
  • Remove UNKNOWNNULLCOUNT where it can be easily computed (#13205) @vyasr
  • Update minimum Python version to Python 3.9 (#13196) @shwina
  • Refactor contiguoussplit API into contiguoussplit.hpp (#13186) @abellina
  • Cleanup Parquet chunked writer (#13094) @ttnghia
  • Cleanup ORC chunked writer (#13091) @ttnghia
  • Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
  • Remove deprecated regex functions from libcudf (#13067) @davidwendt
  • [REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
  • Implement Python dropduplicates with cudf::stabledistinct. (#11656) @brandon-b-miller

πŸ› Bug Fixes

  • Fix valid count computation in offsetbitmaskbinop kernel (#13489) @davidwendt
  • Fix writing of ORC files with empty rowgroups (#13466) @vuule
  • Fix cudf::repeat logic when count is zero (#13459) @davidwendt
  • Fix batch processing for parquet writer (#13438) @ttnghia
  • Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
  • Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
  • Fix cudf::strings::replacewithbackrefs hang on empty match result (#13418) @davidwendt
  • Use <NA> instead of null to match pandas. (#13415) @bdice
  • Fix tokenize with non-space delimiter (#13403) @shwina
  • Fix groupby head/tail for empty dataframe (#13398) @shwina
  • Default to closed="right" in IntervalIndex constructor (#13394) @shwina
  • Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
  • Fix unused argument errors in nvcc 11.5 (#13387) @abellina
  • Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
  • Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
  • Fix page size estimation in Parquet writer (#13364) @etseidl
  • Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
  • Support gcc 12 as the C++ compiler (#13316) @robertmaynard
  • Correctly set bitmask size in from_column_view (#13315) @wence-
  • Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
  • Fix parquet schema interpretation issue (#13277) @hyperbolic2346
  • Fix 64bit shift bug in avro reader (#13276) @karthikeyann
  • Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
  • Clean up buffers in case AssertionError (#13262) @razajafri
  • Allow empty input table in ast compute_column (#13245) @wence-
  • Fix structscolumnwrapper constructors to copy input column wrappers (#13243) @davidwendt
  • Fix the row index stream order in ORC reader (#13242) @vuule
  • Make is_decompression_disabled and is_compression_disabled thread-safe (#13240) @vuule
  • Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
  • Fix race in ORC string dictionary creation (#13214) @revans2
  • Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
  • Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
  • Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
  • Fix hostdevice_vector::subspan (#13187) @ttnghia
  • Use custom nvbench entry point to ensure cudf::nvbench_base_fixture usage (#13183) @robertmaynard
  • Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
  • Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
  • Fix a few clang-format style check errors (#13146) @davidwendt
  • [REVIEW] Fix Series and DataFrame constructors to validate index lengths (#13122) @galipremsagar
  • Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
  • Fix GPUARCHS setting in Java CMake build and CMAKECUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
  • Adds checks to make sure json reader won't overflow (#13115) @elstehle
  • Fix null_count of columns returned by chunked_parquet_reader (#13111) @vuule
  • Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
  • [REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
  • Use makeemptylistscolumn instead of makeemptycolumn(typeid::LIST) (#13099) @davidwendt
  • Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
  • Fix column selection read_parquet benchmarks (#13082) @vuule
  • Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
  • Add algorithm include in data_sink.hpp (#13068) @ahendriksen
  • Fix tests/identifystreamusage.cpp (#13066) @ahendriksen
  • Prevent overflow with skip_rows in ORC and Parquet readers (#13063) @vuule
  • Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
  • [REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
  • Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
  • Fix readavro() skiprows and num_rows. (#12912) @tpn
  • Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
  • Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina

πŸš€ New Features

  • Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
  • Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
  • Use compileor_get in JIT groupby apply (#13350) @brandon-b-miller
  • cuDF numba cuda 12 updates (#13337) @brandon-b-miller
  • Add tz_convert method to convert between timestamps (#13328) @shwina
  • Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
  • Support the case=False argument to str.contains (#13290) @shwina
  • Add an event handler for ColumnVector.close (#13279) @abellina
  • JNI api for cudf::chunked_pack (#13278) @abellina
  • Implement a chunked_pack API (#13260) @abellina
  • Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
  • JNI changes for range-extents in window functions. (#13199) @mythrocks
  • Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
  • Add IS_NULL operator to AST (#13145) @karthikeyann
  • STRING order-by column for RANGE window functions (#13143) @mythrocks
  • Update contains_table to experimental row hasher and equality comparator (#13119) @divyegala
  • Automatically select GroupBy.apply algorithm based on if the UDF is jittable (#13113) @brandon-b-miller
  • Refactor Parquet chunked writer (#13076) @ttnghia
  • Add Python bindings for string literal support in AST (#13073) @karthikeyann
  • Add Java bindings for string literal support in AST (#13072) @karthikeyann
  • Add string scalar support in AST (#13061) @karthikeyann
  • Log cuIO warnings using the libcudf logger (#13043) @vuule
  • Update mixed_join to use experimental row hasher and comparator (#13028) @divyegala
  • Support structs of lists in row lexicographic comparator (#13005) @ttnghia
  • Adding hostdevice_span that is a span createable from hostdevice_vector (#12981) @hyperbolic2346
  • Add nvtext::minhash function (#12961) @davidwendt
  • Support lists of structs in row lexicographic comparator (#12953) @ttnghia
  • Update join to use experimental row hasher and comparator (#12787) @divyegala
  • Implement Python dropduplicates with cudf::stabledistinct. (#11656) @brandon-b-miller

πŸ› οΈ Improvements

  • Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
  • Handle some corner-cases in indexing with boolean masks (#13402) @wence-
  • Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
  • [JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
  • Fix JNI method with mismatched parameter list (#13384) @ttnghia
  • Split up experimentalrowoperator_tests.cu to improve its compile time (#13382) @davidwendt
  • Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
  • Remove UNKNOWNNULLCOUNT (#13372) @vyasr
  • Move some nvtext benchmarks to nvbench (#13368) @davidwendt
  • run docs nightly too (#13366) @AyodeAwe
  • Add warning for default dtype parameter in get_dummies (#13365) @galipremsagar
  • Add log messages about kvikIO compatibility mode (#13363) @vuule
  • Switch back to using primary shared-action-workflows branch (#13362) @vyasr
  • Deprecate StringIndex and use Index instead (#13361) @galipremsagar
  • Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
  • Expunge most uses of TypeVar(bound=&quot;Foo&quot;) (#13346) @wence-
  • Remove all references to UNKNOWNNULLCOUNT in Python (#13345) @vyasr
  • Improve distinct_count with cuco::static_set (#13343) @PointKernel
  • Fix contiguous_split performance (#13342) @ttnghia
  • Remove default UNKNOWNNULLCOUNT from cudf::column member functions (#13341) @davidwendt
  • Update mypy to 1.3 (#13340) @wence-
  • [Java] Purge non-empty nulls when setting validity (#13335) @razajafri
  • Add row-wise filtering step to read_parquet (#13334) @rjzamora
  • Performance improvement for nvtext::minhash (#13333) @davidwendt
  • Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
  • Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
  • Move meta calculation in dask_cudf.read_parquet (#13327) @rjzamora
  • Changes to support Numpy >= 1.24 (#13325) @shwina
  • Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
  • Clean up distinct_count benchmark (#13321) @PointKernel
  • Fix gtest pinning to 1.13.0. (#13319) @bdice
  • Remove null mask and null count from column_view constructors (#13311) @vyasr
  • Address feedback from 13289 (#13306) @vyasr
  • Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
  • First check for BaseDtype when infering the data type of an arbitrary object (#13295) @shwina
  • Throw error if UNINITIALIZED is passed to cudf::statenullcount (#13292) @davidwendt
  • Support CUDA 12.0 for pip wheels (#13289) @divyegala
  • Refactor transform_lists_of_structs in row_operators.cu (#13288) @ttnghia
  • Branch 23.06 merge 23.04 (#13286) @vyasr
  • Update cupy dependency (#13284) @vyasr
  • Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
  • Fix unused variables and functions (#13275) @karthikeyann
  • Fix integer overflow in partition scatter_map construction (#13272) @wence-
  • Numba 0.57 compatibility fixes (#13271) @gmarkall
  • Performance improvement in cudf::strings::allcharactersof_type (#13259) @davidwendt
  • Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
  • Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
  • Build wheels using new single image workflow (#13249) @vyasr
  • Enable sccache hits from local builds (#13248) @AyodeAwe
  • Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
  • Introduce pandas_compatible option in cudf (#13241) @galipremsagar
  • Add metadata_builder helper class (#13232) @abellina
  • Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
  • Remove default null-count parameter from cudf::makestringscolumn factory (#13227) @davidwendt
  • Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
  • Add chunked reader benchmark (#13223) @SrikarVanavasam
  • Set the null count in output columns in the CSV reader (#13221) @vuule
  • Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
  • Fix stringscalar stream usage in writejson.cu (#13212) @davidwendt
  • Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
  • Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
  • Remove UNKNOWNNULLCOUNT where it can be easily computed (#13205) @vyasr
  • Optimization to decoding of parquet level streams (#13203) @nvdbaranec
  • Clean up and simplify gpuDecideCompression (#13202) @vuule
  • Use std::array for a statically sized vector in create_serialized_trie (#13201) @vuule
  • Update minimum Python version to Python 3.9 (#13196) @shwina
  • Refactor contiguoussplit API into contiguoussplit.hpp (#13186) @abellina
  • Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
  • Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
  • Split up unique_count.cu to improve build time (#13169) @davidwendt
  • Use nvtx3 includes in string examples. (#13165) @bdice
  • Change some .cu gtest files to .cpp (#13155) @davidwendt
  • Remove wheel pytest verbosity (#13151) @sevagh
  • Fix libcudf to always pass null-count to setnullmask (#13149) @davidwendt
  • Fix gtests to always pass null-count to setnullmask calls (#13148) @davidwendt
  • Optimize JSON writer (#13144) @karthikeyann
  • Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
  • [REVIEW] Deprecate pad and backfill methods (#13140) @galipremsagar
  • Use CTAD instead of functions in ProtobufReader (#13135) @vuule
  • Remove more instances of UNKNOWN_NULL_COUNT (#13134) @vyasr
  • Update clang-format to 16.0.1. (#13133) @bdice
  • Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
  • Branch 23.06 merge 23.04 (#13131) @vyasr
  • Compute null-count in cudf::detail::slice (#13124) @davidwendt
  • Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
  • Set null-count in linkedcolumnview conversion operator (#13121) @davidwendt
  • Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
  • Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
  • Remove uses-setup-env-vars (#13105) @vyasr
  • Explicitly compute null count in concatenate APIs (#13104) @vyasr
  • Replace unnecessary uses of UNKNOWN_NULL_COUNT (#13102) @vyasr
  • Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
  • Use .element() instead of .data() for window range calculations (#13095) @mythrocks
  • Cleanup Parquet chunked writer (#13094) @ttnghia
  • Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
  • Cleanup ORC chunked writer (#13091) @ttnghia
  • Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
  • Change cudf::test::makenullmask to also return null-count (#13081) @davidwendt
  • Resolved automerger from branch-23.04 to branch-23.06 (#13080) @galipremsagar
  • Assert for non-empty nulls (#13071) @razajafri
  • Remove deprecated regex functions from libcudf (#13067) @davidwendt
  • Refactor cudf::detail::sorted_order (#13062) @ttnghia
  • Improve performance of slice_strings for long strings (#13057) @davidwendt
  • Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
  • [REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
  • Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
  • Remove console output from some libcudf gtests (#13027) @davidwendt
  • Remove underscore in build string. (#13025) @bdice
  • Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
  • Fix auto merger from branch-23.04 to branch-23.06 (#13009) @galipremsagar
  • Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
  • Add nvtx annotatations to groupby methods (#12941) @wence-
  • Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
  • Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
  • Optimize set-like operations (#12769) @ttnghia
  • [REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
  • Add empty test files for test reorganization (#12288) @shwina

- C++
Published by raydouglass over 2 years ago

https://github.com/rapidsai/cudf - v23.04.00

🚨 Breaking Changes

  • Pin dask and distributed for release (#13070) @galipremsagar
  • Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
  • Update minimum pandas and numpy pinnings (#12887) @galipremsagar
  • Deprecate names & dtype in Index.copy (#12825) @galipremsagar
  • Deprecate Index.is_* methods (#12820) @galipremsagar
  • Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
  • Deprecate na_sentinel in factorize (#12817) @galipremsagar
  • Make string methods return a Series with a useful Index (#12814) @shwina
  • Produce useful guidance on overflow error in to_csv (#12705) @wence-
  • Move strings_udf code into cuDF (#12669) @brandon-b-miller
  • Remove cudf::strings::repeatstringsoutputsizes and optional parameter from cudf::strings::repeatstrings (#12609) @davidwendt
  • Replace message parsing with throwing more specific exceptions (#12426) @vyasr

πŸ› Bug Fixes

  • Fix memcheck script to execute only _TEST files found in bin/gtests/libcudf (#13006) @davidwendt
  • Fix DataFrame constructor to broadcast scalar inputs properly (#12997) @galipremsagar
  • Drop force_nullable_schema from chunked parquet writer (#12996) @galipremsagar
  • Fix gtest column utility comparator diff reporting (#12995) @davidwendt
  • Handle index names while performing groupby (#12992) @galipremsagar
  • Fix __setitem__ on string columns when the scalar value ends in a null byte (#12991) @wence-
  • Fix sort_values when column is all empty strings (#12988) @eriknw
  • Remove unused variable and fix memory issue in ORC writer (#12984) @ttnghia
  • Pre-emptive fix for upstream dask.dataframe.read_parquet changes (#12983) @rjzamora
  • Remove MANIFEST.in use auto-generated one for sdists and package_data for wheels (#12960) @vyasr
  • Update to use rapids-export(COMPONENTS) feature. (#12959) @robertmaynard
  • cudftestutil supports static gtest dependencies (#12957) @robertmaynard
  • Include gtest in build environment. (#12956) @vyasr
  • Correctly handle scalar indices in Index.__getitem__ (#12955) @wence-
  • Avoid building cython twice (#12945) @galipremsagar
  • Fix set index error for Series rolling window operations (#12942) @galipremsagar
  • Fix calculation of null counts for Parquet statistics (#12938) @etseidl
  • Preserve integer dtype of hive-partitioned column containing nulls (#12930) @rjzamora
  • Use getcurrentdeviceresource for intermediate allocations in COLLECTLIST window code (#12927) @karthikeyann
  • Mark dlpack tensor deleter as noexcept to match PyCapsule_Destructor signature. (#12921) @bdice
  • Fix conda recipe post-link.sh typo (#12916) @pentschev
  • minrows and numrows are swapped in ComputePageSizes declaration in Parquet reader (#12886) @etseidl
  • Expect cupy to now support bool arrays for dlpack. (#12883) @vyasr
  • Use python -m pytest for nightly wheel tests (#12871) @bdice
  • Parquet writer columnsize() should return a sizet (#12870) @etseidl
  • Fix cudf::hash_partition kernel launch error with decimal128 types (#12863) @davidwendt
  • Fix an issue with parquet chunked reader undercounting string lengths. (#12859) @nvdbaranec
  • Remove tokenizers pre-install pinning. (#12854) @vyasr
  • Fix parquet RangeIndex bug (#12838) @rjzamora
  • Remove KAFKAHOSTTEST from compute-sanitizer check (#12831) @davidwendt
  • Make string methods return a Series with a useful Index (#12814) @shwina
  • Tell cudf_kafka to use header-only fmt (#12796) @vyasr
  • Add GroupBy.dtypes (#12783) @galipremsagar
  • Fix a leak in a test and clarify some test names (#12781) @revans2
  • Fix bug in all-null list due to joinlistelements special handling (#12767) @karthikeyann
  • Add try/except for expected null-schema error in read_parquet (#12756) @rjzamora
  • Throw an exception if an unsupported page encoding is detected in Parquet reader (#12754) @etseidl
  • Fix a bug with num_keys in _scatter_by_slice (#12749) @thomcom
  • Bump pinned rapids wheel deps to 23.4 (#12735) @sevagh
  • Rework logic in cudf::strings::split_record to improve performance (#12729) @davidwendt
  • Add always_nullable flag to Dremel encoding (#12727) @divyegala
  • Fix memcheck read error in compound segmented reduce (#12722) @davidwendt
  • Fix faulty conditional logic in JIT GroupBy.apply (#12706) @brandon-b-miller
  • Produce useful guidance on overflow error in to_csv (#12705) @wence-
  • Handle parquet list data corner case (#12698) @nvdbaranec
  • Fix missing trailing comma in json writer (#12688) @karthikeyann
  • Remove child fom newCudaAsyncMemoryResource (#12681) @abellina
  • Handle bool types in round API (#12670) @galipremsagar
  • Ensure all of device bitmask is initialized in from_arrow (#12668) @wence-
  • Fix from_arrow to load a sliced arrow table (#12665) @galipremsagar
  • Fix dask-cudf read_parquet bug for multi-file aggregation (#12663) @rjzamora
  • Fix AllocateLikeTest gtests reading uninitialized null-mask (#12643) @davidwendt
  • Fix find_common_dtype and values to handle complex dtypes (#12537) @galipremsagar
  • Fix fetching of MultiIndex values when a label is passed (#12521) @galipremsagar
  • Fix Series comparison vs scalars (#12519) @brandon-b-miller
  • Allow casting from UDFString back to StringView to call methods in strings_udf (#12363) @brandon-b-miller

πŸ“– Documentation

  • Fix GroupBy.apply doc examples rendering (#12994) @brandon-b-miller
  • add sphinx building and s3 uploading for dask-cudf docs (#12982) @quasiben
  • Add developer documentation forbidding default parameters in detail APIs (#12978) @vyasr
  • Add README symlink for dask-cudf. (#12946) @bdice
  • Remove return type from @return doxygen tags (#12908) @davidwendt
  • Fix docs build to be pydata-sphinx-theme=0.13.0 compatible (#12874) @galipremsagar
  • Add skeleton API and prose documentation for dask-cudf (#12725) @wence-
  • Enable doctests for GroupBy methods (#12658) @brandon-b-miller
  • Add comment about CUB patch for SegmentedSortInt.Bool gtest (#12611) @davidwendt

πŸš€ New Features

  • Add JNI method for strings::replace multi variety (#12979) @NVnavkumar
  • Add nunique aggregation support for cudf::segmented_reduce (#12972) @davidwendt
  • Refactor orc chunked writer (#12949) @ttnghia
  • Make Parquet writer nullable option application to single table writes (#12933) @vuule
  • Refactor io::orc::ProtobufWriter (#12877) @ttnghia
  • Make timezone table independent from ORC (#12805) @vuule
  • Cache JIT GroupBy.apply functions (#12802) @brandon-b-miller
  • Implement initial support for avro logical types (#6482) (#12788) @tpn
  • Update tests/column_utilities to use experimental::equality row comparator (#12777) @divyegala
  • Update distinct/unique_count to experimental::row hasher/comparator (#12776) @divyegala
  • Update hash_partition to use experimental::row::row_hasher (#12761) @divyegala
  • Update is_sorted to use experimental::row::lexicographic (#12752) @divyegala
  • Update default data source in cuio reader benchmarks (#12740) @PointKernel
  • Reenable stream identification library in CI (#12714) @vyasr
  • Add regex_program strings splitting java APIs and tests (#12713) @cindyyuanjiang
  • Add regex_program strings replacing java APIs and tests (#12701) @cindyyuanjiang
  • Add regex_program strings extract java APIs and tests (#12699) @cindyyuanjiang
  • Variable fragment sizes for Parquet writer (#12685) @etseidl
  • Add segmented reduction support for fixed-point types (#12680) @davidwendt
  • Move strings_udf code into cuDF (#12669) @brandon-b-miller
  • Add regex_program searching APIs and related java classes (#12666) @cindyyuanjiang
  • Add logging to libcudf (#12637) @vuule
  • Add compound aggregations to cudf::segmented_reduce (#12573) @davidwendt
  • Convert rank to use to experimental row comparators (#12481) @divyegala
  • Use rapids-cmake parallel testing feature (#12451) @robertmaynard
  • Enable detection of undesired stream usage (#12089) @vyasr

πŸ› οΈ Improvements

  • Pin dask and distributed for release (#13070) @galipremsagar
  • Pin cupy in wheel tests to supported versions (#13041) @vyasr
  • Pin numba version (#13001) @vyasr
  • Rework gtests SequenceTest to remove using namepace cudf (#12985) @davidwendt
  • Stop setting package version attribute in wheels (#12977) @vyasr
  • Move detail reduction functions to cudf::reduction::detail namespace (#12971) @davidwendt
  • Remove default detail mrs: part7 (#12970) @vyasr
  • Remove default detail mrs: part6 (#12969) @vyasr
  • Remove default detail mrs: part5 (#12968) @vyasr
  • Remove default detail mrs: part4 (#12967) @vyasr
  • Remove default detail mrs: part3 (#12966) @vyasr
  • Remove default detail mrs: part2 (#12965) @vyasr
  • Remove default detail mrs: part1 (#12964) @vyasr
  • Add force_nullable_schema parameter to Parquet writer. (#12952) @galipremsagar
  • Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
  • Remove remaining default stream parameters (#12943) @vyasr
  • Fix cudf::segmented_reduce gtest for ANY aggregation (#12940) @davidwendt
  • Implement groupby.head and groupby.tail (#12939) @wence-
  • Fix libcudf gtests to pass null-count=0 for empty validity masks (#12923) @davidwendt
  • Migrate parquet encoding to use experimental row operators (#12918) @PointKernel
  • Fix benchmarks coded in namespace cudf and using namespace cudf (#12915) @karthikeyann
  • Fix io/text gtests coded in namespace cudf::test (#12914) @karthikeyann
  • Pass SCCACHE_S3_USE_SSL to conda builds (#12910) @ajschmidt8
  • Fix FST, JSON gtests & benchmarks coded in namespace cudf::test (#12907) @karthikeyann
  • Generate pyproject dependencies using dfg (#12906) @vyasr
  • Update libcudf counting functions to specify cudf::size_type (#12904) @davidwendt
  • Fix moto env vars & pass AWS_SESSION_TOKEN to conda builds (#12902) @ajschmidt8
  • Rewrite CSV writer benchmark with nvbench (#12901) @PointKernel
  • Rework some code logic to reduce iterator and comparator inlining to improve compile time (#12900) @davidwendt
  • Deprecate line_terminator in favor of lineterminator in to_csv (#12896) @wence-
  • Add stream and mr parameters for structs::detail::flatten_nested_columns (#12892) @ttnghia
  • Deprecate libcudf regex APIs accepting pattern strings directly (#12891) @davidwendt
  • Remove default parameters from detail headers in include (#12888) @vyasr
  • Update minimum pandas and numpy pinnings (#12887) @galipremsagar
  • Implement groupby.sample (#12882) @wence-
  • Update JNI build ENV default to gcc 11 (#12881) @pxLi
  • Change return type of cudf::structs::detail::flatten_nested_columns to smart pointer (#12878) @ttnghia
  • Fix passing seed parameter to MurmurHash3_32 in cudf::hash() function (#12875) @davidwendt
  • Remove manual artifact upload step in CI (#12869) @ajschmidt8
  • Update to GCC 11 (#12868) @bdice
  • Fix null hive-partition behavior in dask-cudf parquet (#12866) @rjzamora
  • Update to protobuf>=4.21.6,<4.22. (#12864) @bdice
  • Update RMM allocators (#12861) @pentschev
  • Improve performance for replace-multi for long strings (#12858) @davidwendt
  • Drop Python 3.7 handling for pickle protocol 4 (#12857) @jakirkham
  • Migrate as much as possible to pyproject.toml (#12850) @vyasr
  • Enable nbqa pre-commit hooks for isort and black. (#12848) @bdice
  • Setting a threshold for KvikIO IO (#12841) @madsbk
  • Update datasets download URL (#12840) @jjacobelli
  • Make docs builds less verbose (#12836) @AyodeAwe
  • Consolidate linter configs into pyproject.toml (#12834) @vyasr
  • Deprecate names & dtype in Index.copy (#12825) @galipremsagar
  • Deprecate inplace parameters in categorical methods (#12824) @galipremsagar
  • Add optional text file support to ninja-log utility (#12823) @davidwendt
  • Deprecate Index.is_* methods (#12820) @galipremsagar
  • Add dfg as a pre-commit hook (#12819) @vyasr
  • Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
  • Deprecate na_sentinel in factorize (#12817) @galipremsagar
  • Shuffling read into a sub function in parquet read (#12809) @hyperbolic2346
  • Fixing parquet coalescing of reads (#12808) @hyperbolic2346
  • CI: Remove specification of manual stage for check_style.sh script. (#12803) @csadorf
  • Add compute-sanitizer github workflow action to nightly tests (#12800) @davidwendt
  • Enable groupby std and variance aggregation types in libcudf Debug build (#12799) @davidwendt
  • Expose seed argument to hash_values (#12795) @ayushdg
  • Fix groupby gtests coded in namespace cudf::test (#12784) @davidwendt
  • Improve performance for cudf::strings::count_characters for long strings (#12779) @davidwendt
  • Deallocate encoded data in ORC writer immediately after compression (#12770) @vuule
  • Stop force pulling fmt in nvbench. (#12768) @vyasr
  • Remove now redundant cuda initialization (#12758) @vyasr
  • Adds JSON reader, writer io benchmark (#12753) @karthikeyann
  • Use test paths relative to package directory. (#12751) @bdice
  • Add build metrics report as artifact to cpp-build workflow (#12750) @davidwendt
  • Add JNI methods for detecting and purging non-empty nulls from LIST and STRUCT (#12742) @razajafri
  • Stop using versioneer to manage versions (#12741) @vyasr
  • Reduce error handling verbosity in CI tests scripts (#12738) @AjayThorve
  • Reduce the number of test cases in multibyte_split benchmark (#12737) @PointKernel
  • Update shared workflow branches (#12733) @ajschmidt8
  • JNI switches to nested JSON reader (#12732) @res-life
  • Changing cudf::io::source_info to use cudf::host_span&lt;std::byte&gt; in a non-breaking form (#12730) @hyperbolic2346
  • Add nvbench environment class for initializing RMM in benchmarks (#12728) @davidwendt
  • Split C++ and Python build dependencies into separate lists. (#12724) @bdice
  • Add build dependencies to Java tests. (#12723) @bdice
  • Allow setting the seed argument for hash partition (#12715) @firestarman
  • Remove gpuCI scripts. (#12712) @bdice
  • Unpin dask and distributed for development (#12710) @galipremsagar
  • partition_by_hash(): use _split() (#12704) @madsbk
  • Remove DataFrame.quantiles from docs. (#12684) @bdice
  • Fast path for experimental::row::equality (#12676) @divyegala
  • Move date to build string in conda recipe (#12661) @ajschmidt8
  • Refactor reduction logic for fixed-point types (#12652) @davidwendt
  • Pay off some JNI RMM API tech debt (#12632) @revans2
  • Merge copy-on-write feature branch into branch-23.04 (#12619) @galipremsagar
  • Remove cudf::strings::repeatstringsoutputsizes and optional parameter from cudf::strings::repeatstrings (#12609) @davidwendt
  • Pin cuda-nvrtc. (#12606) @bdice
  • Remove cudf::test::print calls from libcudf gtests (#12604) @davidwendt
  • Init JNI version 23.04.0-SNAPSHOT (#12599) @pxLi
  • Add performance benchmarks to user facing docs (#12595) @galipremsagar
  • Add docs build job (#12592) @AyodeAwe
  • Replace message parsing with throwing more specific exceptions (#12426) @vyasr
  • Support conversion to/from cudf in dask.dataframe.core.to_backend (#12380) @rjzamora

- C++
Published by raydouglass almost 3 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v23.02.00

πŸ”— Links

🚨 Breaking Changes

  • Pin dask and distributed for release (#12695) @galipremsagar
  • Change ways to access ptr in Buffer (#12587) @galipremsagar
  • Remove column names (#12578) @vuule
  • Default cudf::io::read_json to nested JSON parser (#12544) @vuule
  • Switch engine=cudf to the new JSON reader (#12509) @galipremsagar
  • Add trailing comma support for nested JSON reader (#12448) @karthikeyann
  • Upgrade to arrow-10.0.1 (#12327) @galipremsagar
  • Fail loudly to avoid data corruption with unsupported input in read_orc (#12325) @vuule
  • CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
  • Remove deprecated code for 23.02 (#12281) @vyasr
  • Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
  • Purge non-empty nulls for superimpose_nulls and push_down_nulls (#12239) @ttnghia
  • Rename cudf::structs::detail::superimpose_parent_nulls APIs (#12230) @ttnghia
  • Remove JIT type names, refactor idtotype. (#12158) @bdice
  • Floor division uses integer division for integral arguments (#12131) @wence-

πŸ› Bug Fixes

  • Fix update-version.sh (#12745) @raydouglass
  • Fix a mask data corruption in UDF (#12647) @galipremsagar
  • pre-commit: Update isort version to 5.12.0 (#12645) @wence-
  • tests: Skip cuInit tests if cuda-gdb is not found or not working (#12644) @wence-
  • Revert regex program java APIs and tests (#12639) @cindyyuanjiang
  • Fix leaks in ColumnVectorTest (#12625) @jlowe
  • Handle when spillable buffers own each other (#12607) @madsbk
  • Fix incorrect null counts for sliced columns in JCudfSerialization (#12589) @jlowe
  • lists: Transfer dtypes correctly through list.get (#12586) @wence-
  • timedelta: Don't go via float intermediates for floordiv (#12585) @wence-
  • Fixing BUG, get_next_chunk() should use the blocking function device_read() (#12584) @madsbk
  • Make JNI QuoteStyle accessible outside ai.rapids.cudf (#12572) @mythrocks
  • partition_by_hash(): support index (#12554) @madsbk
  • Mixed Join benchmark bug due to wrong conditional column (#12553) @divyegala
  • Update List Lexicographical Comparator (#12538) @divyegala
  • Dynamically read PTX version (#12534) @brandon-b-miller
  • build.sh switch to use RAPIDS magic value (#12525) @robertmaynard
  • Loosen runtime arrow pinning (#12522) @vyasr
  • Enable metadata transfer for complex types in transpose (#12491) @galipremsagar
  • Fix issues with parquet chunked reader (#12488) @nvdbaranec
  • Fix missing metadata transfer in concat for ListColumn (#12487) @galipremsagar
  • Rename libcudf substring source files to slice (#12484) @davidwendt
  • Fix compile issue with arrow 10 (#12465) @ttnghia
  • Fix List offsets bug in mixed type list column in nested JSON reader (#12447) @karthikeyann
  • Fix xfail incompatibilities (#12423) @vyasr
  • Fix bug in Parquet column index encoding (#12404) @etseidl
  • When building Arrow shared look for a shared OpenSSL (#12396) @robertmaynard
  • Fix getjsonobject to return empty column on empty input (#12384) @davidwendt
  • Pin arrow 9 in testing dependencies to prevent conda solve issues (#12377) @vyasr
  • Fix reductions any/all return value for empty input (#12374) @davidwendt
  • Fix debug compile errors in parquet.hpp (#12372) @davidwendt
  • Purge non-empty nulls in cudf::make_lists_column (#12370) @ttnghia
  • Use correct memory resource in io::make_column (#12364) @vyasr
  • Add code to detect possible malformed page data in parquet files. (#12360) @nvdbaranec
  • Fail loudly to avoid data corruption with unsupported input in read_orc (#12325) @vuule
  • Fix NumericPairIteratorTest for float values (#12306) @davidwendt
  • Fixes memory allocation in nested JSON tokenizer (#12300) @elstehle
  • Reconstruct dtypes correctly for list aggs of struct columns (#12290) @wence-
  • Fix regex \A and \Z to strictly match string begin/end (#12282) @davidwendt
  • Fix compile issue in json_chunked_reader.cpp (#12280) @ttnghia
  • Change reductions any/all to return valid values for empty input (#12279) @davidwendt
  • Only exclude join keys that are indices from key columns (#12271) @wence-
  • Fix spill to device limit (#12252) @madsbk
  • Correct behaviour of sort in concat for singleton concatenations (#12247) @wence-
  • Purge non-empty nulls for superimpose_nulls and push_down_nulls (#12239) @ttnghia
  • Patch CUB DeviceSegmentedSort and remove workaround (#12234) @davidwendt
  • Fix memory leak in udf_string::assign(&&) function (#12206) @davidwendt
  • Workaround thrust-copy-if limit in json gettreerepresentation (#12190) @davidwendt
  • Fix page size calculation in Parquet writer (#12182) @etseidl
  • Add cudf::detail::sizestooffsets_iterator to allow checking overflow in offsets (#12180) @davidwendt
  • Workaround thrust-copy-if limit in wordpiece-tokenizer (#12168) @davidwendt
  • Floor division uses integer division for integral arguments (#12131) @wence-

πŸ“– Documentation

  • Fix link to NVTX (#12598) @sameerz
  • Include missing groupby functions in documentation (#12580) @quasiben
  • Fix documentation author (#12527) @bdice
  • Update libcudf reduction docs for casting output types (#12526) @davidwendt
  • Add JSON reader page in user guide (#12499) @GregoryKimball
  • Link unsupported iteration API docstrings (#12482) @galipremsagar
  • strings_udf doc update (#12469) @brandon-b-miller
  • Update cudf_assert docs with correct NDEBUG behavior (#12464) @robertmaynard
  • Update pre-commit hooks guide (#12395) @bdice
  • Update test docs to not use detail comparison utilities (#12332) @PointKernel
  • Fix doxygen description for regexprogram::computeworkingmemorysize (#12329) @davidwendt
  • Add eval to docs. (#12322) @vyasr
  • Turn on xfail_strict=true (#12244) @wence-
  • Update 10 minutes to cuDF (#12114) @wence-

πŸš€ New Features

  • Use kvikIO as the default IO backend (#12574) @vuule
  • Use has_nonempty_nulls instead of may_contain_non_empty_nulls in superimpose_nulls and push_down_nulls (#12560) @ttnghia
  • Add strings methods removeprefix and removesuffix (#12557) @davidwendt
  • Add regex_program java APIs and unit tests (#12548) @cindyyuanjiang
  • Default cudf::io::read_json to nested JSON parser (#12544) @vuule
  • Make string quoting optional on CSV write (#12539) @mythrocks
  • Use new nvCOMP API to optimize the compression temp memory size (#12533) @vuule
  • Support "values" orient (array of arrays) in Nested JSON reader (#12498) @karthikeyann
  • one_hot_encode to use experimental row comparators (#12478) @divyegala
  • Support %W and %w format specifiers in cudf::strings::to_timestamps (#12475) @davidwendt
  • Add JSON Writer (#12474) @karthikeyann
  • Refactor thrust_copy_if into cudf::detail::copy_if_safe (#12455) @ttnghia
  • Add trailing comma support for nested JSON reader (#12448) @karthikeyann
  • Extract tokenize_json.hpp detail header from src/io/json/nested_json.hpp (#12432) @ttnghia
  • JNI bindings to write CSV (#12425) @mythrocks
  • Nested JSON depth benchmark (#12371) @karthikeyann
  • Implement lists::reverse (#12336) @ttnghia
  • Use device_read in experimental read_json (#12314) @vuule
  • Implement JNI for strings::reverse (#12283) @ttnghia
  • Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
  • Add cudf::strings:like function with multiple patterns (#12269) @davidwendt
  • Add environment variable to control host memory allocation in hostdevice_vector (#12251) @vuule
  • Add cudf::strings::reverse function (#12227) @davidwendt
  • Selectively use dictionary encoding in Parquet writer (#12211) @etseidl
  • Support replace in strings_udf (#12207) @brandon-b-miller
  • Add support to read binary encoded decimals in parquet (#12205) @PointKernel
  • Support regex EOL where the string ends with a new-line character (#12181) @davidwendt
  • Updating stream_compaction/unique to use new row comparators (#12159) @divyegala
  • Add device buffer datasource (#12024) @PointKernel
  • Implement groupby apply with JIT (#11452) @bwyogatama

πŸ› οΈ Improvements

  • Update shared workflow branches (#12696) @ajschmidt8
  • Pin dask and distributed for release (#12695) @galipremsagar
  • Don't upload libcudf-example to Anaconda.org (#12671) @ajschmidt8
  • Pin wheel dependencies to same RAPIDS release (#12659) @sevagh
  • Use CTK 118/cp310 branch of wheel workflows (#12602) @sevagh
  • Change ways to access ptr in Buffer (#12587) @galipremsagar
  • Version a parquet writer xfail (#12579) @galipremsagar
  • Remove column names (#12578) @vuule
  • Parquet reader optimization to address V100 regression. (#12577) @nvdbaranec
  • Add support for category dtypes in CSV reader (#12571) @galipremsagar
  • Remove spill_lock parameter from SpillableBuffer.get_ptr() (#12564) @madsbk
  • Optimize cudf::make_lists_column (#12547) @ttnghia
  • Remove cudf::strings::repeat_strings_output_sizes from Java and JNI (#12546) @ttnghia
  • Test that cuInit is not called when RAPIDSNOINITIALIZE is set (#12545) @wence-
  • Rework repeat_strings to use sizes-to-offsets utility (#12543) @davidwendt
  • Replace exclusivescan with sizesto_offsets in cudf::lists::sequences (#12541) @davidwendt
  • Rework nvtext::ngrams_tokenize to use sizes-to-offsets utility (#12540) @davidwendt
  • Fix binary-ops gtests coded in namespace cudf::test (#12536) @davidwendt
  • More @acquire_spill_lock() and as_buffer(..., exposed=False) (#12535) @madsbk
  • Guard CUDA runtime APIs with error checking (#12531) @PointKernel
  • Update TODOs from issue 10432. (#12528) @bdice
  • Update rapids-cmake definitions version in GitHub Actions style checks. (#12511) @bdice
  • Switch engine=cudf to the new JSON reader (#12509) @galipremsagar
  • Fix SUM/MEAN aggregation type support. (#12503) @bdice
  • Stop using pandas._testing (#12492) @vyasr
  • Fix ROLLING_TEST gtests coded in namespace cudf::test (#12490) @davidwendt
  • Fix erroneously skipped ORC ZSTD test (#12486) @vuule
  • Rework nvtext::generatecharacterngrams to use makestringschildren (#12480) @davidwendt
  • Raise warnings as errors in the test suite (#12468) @vyasr
  • Remove int32 hard-coding in python (#12467) @galipremsagar
  • Use cudaMemcpyDefault. (#12466) @bdice
  • Update workflows for nightly tests (#12462) @ajschmidt8
  • Build CUDA 11.8 and Python 3.10 Packages (#12457) @ajschmidt8
  • JNI build image default as cuda11.8 (#12441) @pxLi
  • Re-enable Recently Updated Check (#12435) @ajschmidt8
  • Rework remaining cudf::strings::fromxyz functions to use makestrings_children (#12434) @vuule
  • Build wheels alongside conda CI (#12427) @sevagh
  • Remove arguments for checking exception messages in Python (#12424) @vyasr
  • Clean up cuco usage (#12421) @PointKernel
  • Fix warnings in remaining modules (#12406) @vyasr
  • Update ops-bot.yaml (#12402) @ajschmidt8
  • Rework cudf::strings::integerstoipv4 to use makestringschildren utility (#12401) @davidwendt
  • Use numpy.empty() instead of bytearray to allocate host memory for spilling (#12399) @madsbk
  • Deprecate chunksize from daskcudf.readcsv (#12394) @rjzamora
  • Expose the RMM pool size in JNI (#12390) @revans2
  • Fix COPYING_TEST: gtests coded in namespace cudf::test (#12387) @davidwendt
  • Rework cudf::strings::urlencode to use makestrings_children utility (#12385) @davidwendt
  • Use makestringschildren in parse_data nested json reader (#12382) @karthikeyann
  • Fix warnings in test_datetime.py (#12381) @vyasr
  • Mixed Join Benchmarks (#12375) @divyegala
  • Fix warnings in dataframe.py (#12369) @vyasr
  • Update conda recipes. (#12368) @bdice
  • Use gpu-latest-1 runner tag (#12366) @bdice
  • Rework cudf::strings::frombooleans to use makestrings_children (#12365) @vuule
  • Fix warnings in test modules up to test_dataframe.py (#12355) @vyasr
  • JSON column performance optimization - struct column nulls (#12354) @karthikeyann
  • Accelerate stable-segmented-sort with CUB segmented sort (#12347) @davidwendt
  • Add size check to makeoffsetschild_column utility (#12345) @davidwendt
  • Enable max compression ratio small block optimization for ZSTD (#12338) @vuule
  • Fix warnings in test_monotonic.py (#12334) @vyasr
  • Improve JSON column creation performance (list offsets) (#12330) @karthikeyann
  • Upgrade to arrow-10.0.1 (#12327) @galipremsagar
  • Fix warnings in test_orc.py (#12326) @vyasr
  • Fix warnings in test_groupby.py (#12324) @vyasr
  • Fix test_notebooks.sh (#12323) @ajschmidt8
  • Fix transform gtests coded in namespace cudf::test (#12321) @davidwendt
  • Fix check_style.sh script (#12320) @ajschmidt8
  • Rework cudf::strings::fromtimestamps to use makestrings_children (#12317) @davidwendt
  • Fix warnings in test_index.py (#12313) @vyasr
  • Fix warnings in test_multiindex.py (#12310) @vyasr
  • CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
  • Fix warnings in test_indexing.py (#12305) @vyasr
  • Fix warnings in test_joining.py (#12304) @vyasr
  • Unpin dask and distributed for development (#12302) @galipremsagar
  • Re-enable sccache for Jenkins builds (#12297) @ajschmidt8
  • Define needs for pr-builder workflow. (#12296) @bdice
  • Forward merge 22.12 into 23.02 (#12294) @vyasr
  • Fix warnings in test_stats.py (#12293) @vyasr
  • Fix table gtests coded in namespace cudf::test (#12292) @davidwendt
  • Change cython for regex calls to use cudf::strings::regex_program (#12289) @davidwendt
  • Improved error reporting when reading multiple JSON files (#12285) @vuule
  • Deprecate Frame.sumofsquares (#12284) @vyasr
  • Remove deprecated code for 23.02 (#12281) @vyasr
  • Clean up handling of maxpagesize_bytes in Parquet writer (#12277) @etseidl
  • Fix replace gtests coded in namespace cudf::test (#12270) @davidwendt
  • Add pandas nullable type support in Index.to_pandas (#12268) @galipremsagar
  • Rework nvtext::detokenize to use indexalator for row indices (#12267) @davidwendt
  • Fix reduction gtests coded in namespace cudf::test (#12257) @davidwendt
  • Remove default parameters from cudf::detail::sort function declarations (#12254) @davidwendt
  • Add duplicated support for Series, DataFrame and Index (#12246) @galipremsagar
  • Replace column/table test utilities with macros (#12242) @PointKernel
  • Rework cudf::strings::pad and zfill to use makestringschildren (#12238) @davidwendt
  • Fix sort gtests coded in namespace cudf::test (#12237) @davidwendt
  • Wrapping concat and file writes in @acquire_spill_lock() (#12232) @madsbk
  • Rename cudf::structs::detail::superimpose_parent_nulls APIs (#12230) @ttnghia
  • Cover parsing to decimal types in read_json tests (#12229) @vuule
  • Spill Statistics (#12223) @madsbk
  • Use CUDFJNIENABLE_PROFILING to conditionally enable profiling support. (#12221) @bdice
  • Clean up of test_spilling.py (#12220) @madsbk
  • Simplify repetitive boolean logic (#12218) @vuule
  • Add Series.hasnans and Index.hasnans (#12214) @galipremsagar
  • Add cudf::strings:udf::replace function (#12210) @davidwendt
  • Adds in new java APIs for appending byte arrays to host columnar data (#12208) @revans2
  • Remove Python dependencies from Java CI. (#12193) @bdice
  • Fix null order in sort-based groupby and improve groupby tests (#12191) @divyegala
  • Move strings children functions from cudf/strings/detail/utilities.cuh to new header (#12185) @davidwendt
  • Clean up existing JNI scalar to column code (#12173) @revans2
  • Remove JIT type names, refactor idtotype. (#12158) @bdice
  • Update JNI version to 23.02.0-SNAPSHOT (#12129) @pxLi
  • Minor refactor of cpp/src/io/parquet/page_data.cu (#12126) @etseidl
  • Add codespell as a linter (#12097) @benfred
  • Enable specifying exceptions in error macros (#12078) @vyasr
  • Move _label_encoding from Series to Column (#12040) @shwina
  • Add GitHub Actions Workflows (#12002) @ajschmidt8
  • Consolidate dask-cudf groupby_agg calls in one place (#10835) @charlesbluca

- C++
Published by rapids-bot[bot] about 3 years ago

https://github.com/rapidsai/cudf - v23.02.00

🚨 Breaking Changes

  • Pin dask and distributed for release (#12695) @galipremsagar
  • Change ways to access ptr in Buffer (#12587) @galipremsagar
  • Remove column names (#12578) @vuule
  • Default cudf::io::read_json to nested JSON parser (#12544) @vuule
  • Switch engine=cudf to the new JSON reader (#12509) @galipremsagar
  • Add trailing comma support for nested JSON reader (#12448) @karthikeyann
  • Upgrade to arrow-10.0.1 (#12327) @galipremsagar
  • Fail loudly to avoid data corruption with unsupported input in read_orc (#12325) @vuule
  • CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
  • Remove deprecated code for 23.02 (#12281) @vyasr
  • Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
  • Purge non-empty nulls for superimpose_nulls and push_down_nulls (#12239) @ttnghia
  • Rename cudf::structs::detail::superimpose_parent_nulls APIs (#12230) @ttnghia
  • Remove JIT type names, refactor idtotype. (#12158) @bdice
  • Floor division uses integer division for integral arguments (#12131) @wence-

πŸ› Bug Fixes

  • Fix a mask data corruption in UDF (#12647) @galipremsagar
  • pre-commit: Update isort version to 5.12.0 (#12645) @wence-
  • tests: Skip cuInit tests if cuda-gdb is not found or not working (#12644) @wence-
  • Revert regex program java APIs and tests (#12639) @cindyyuanjiang
  • Fix leaks in ColumnVectorTest (#12625) @jlowe
  • Handle when spillable buffers own each other (#12607) @madsbk
  • Fix incorrect null counts for sliced columns in JCudfSerialization (#12589) @jlowe
  • lists: Transfer dtypes correctly through list.get (#12586) @wence-
  • timedelta: Don't go via float intermediates for floordiv (#12585) @wence-
  • Fixing BUG, get_next_chunk() should use the blocking function device_read() (#12584) @madsbk
  • Make JNI QuoteStyle accessible outside ai.rapids.cudf (#12572) @mythrocks
  • partition_by_hash(): support index (#12554) @madsbk
  • Mixed Join benchmark bug due to wrong conditional column (#12553) @divyegala
  • Update List Lexicographical Comparator (#12538) @divyegala
  • Dynamically read PTX version (#12534) @brandon-b-miller
  • build.sh switch to use RAPIDS magic value (#12525) @robertmaynard
  • Loosen runtime arrow pinning (#12522) @vyasr
  • Enable metadata transfer for complex types in transpose (#12491) @galipremsagar
  • Fix issues with parquet chunked reader (#12488) @nvdbaranec
  • Fix missing metadata transfer in concat for ListColumn (#12487) @galipremsagar
  • Rename libcudf substring source files to slice (#12484) @davidwendt
  • Fix compile issue with arrow 10 (#12465) @ttnghia
  • Fix List offsets bug in mixed type list column in nested JSON reader (#12447) @karthikeyann
  • Fix xfail incompatibilities (#12423) @vyasr
  • Fix bug in Parquet column index encoding (#12404) @etseidl
  • When building Arrow shared look for a shared OpenSSL (#12396) @robertmaynard
  • Fix getjsonobject to return empty column on empty input (#12384) @davidwendt
  • Pin arrow 9 in testing dependencies to prevent conda solve issues (#12377) @vyasr
  • Fix reductions any/all return value for empty input (#12374) @davidwendt
  • Fix debug compile errors in parquet.hpp (#12372) @davidwendt
  • Purge non-empty nulls in cudf::make_lists_column (#12370) @ttnghia
  • Use correct memory resource in io::make_column (#12364) @vyasr
  • Add code to detect possible malformed page data in parquet files. (#12360) @nvdbaranec
  • Fail loudly to avoid data corruption with unsupported input in read_orc (#12325) @vuule
  • Fix NumericPairIteratorTest for float values (#12306) @davidwendt
  • Fixes memory allocation in nested JSON tokenizer (#12300) @elstehle
  • Reconstruct dtypes correctly for list aggs of struct columns (#12290) @wence-
  • Fix regex \A and \Z to strictly match string begin/end (#12282) @davidwendt
  • Fix compile issue in json_chunked_reader.cpp (#12280) @ttnghia
  • Change reductions any/all to return valid values for empty input (#12279) @davidwendt
  • Only exclude join keys that are indices from key columns (#12271) @wence-
  • Fix spill to device limit (#12252) @madsbk
  • Correct behaviour of sort in concat for singleton concatenations (#12247) @wence-
  • Purge non-empty nulls for superimpose_nulls and push_down_nulls (#12239) @ttnghia
  • Patch CUB DeviceSegmentedSort and remove workaround (#12234) @davidwendt
  • Fix memory leak in udf_string::assign(&&) function (#12206) @davidwendt
  • Workaround thrust-copy-if limit in json gettreerepresentation (#12190) @davidwendt
  • Fix page size calculation in Parquet writer (#12182) @etseidl
  • Add cudf::detail::sizestooffsets_iterator to allow checking overflow in offsets (#12180) @davidwendt
  • Workaround thrust-copy-if limit in wordpiece-tokenizer (#12168) @davidwendt
  • Floor division uses integer division for integral arguments (#12131) @wence-

πŸ“– Documentation

  • Fix link to NVTX (#12598) @sameerz
  • Include missing groupby functions in documentation (#12580) @quasiben
  • Fix documentation author (#12527) @bdice
  • Update libcudf reduction docs for casting output types (#12526) @davidwendt
  • Add JSON reader page in user guide (#12499) @GregoryKimball
  • Link unsupported iteration API docstrings (#12482) @galipremsagar
  • strings_udf doc update (#12469) @brandon-b-miller
  • Update cudf_assert docs with correct NDEBUG behavior (#12464) @robertmaynard
  • Update pre-commit hooks guide (#12395) @bdice
  • Update test docs to not use detail comparison utilities (#12332) @PointKernel
  • Fix doxygen description for regexprogram::computeworkingmemorysize (#12329) @davidwendt
  • Add eval to docs. (#12322) @vyasr
  • Turn on xfail_strict=true (#12244) @wence-
  • Update 10 minutes to cuDF (#12114) @wence-

πŸš€ New Features

  • Use kvikIO as the default IO backend (#12574) @vuule
  • Use has_nonempty_nulls instead of may_contain_non_empty_nulls in superimpose_nulls and push_down_nulls (#12560) @ttnghia
  • Add strings methods removeprefix and removesuffix (#12557) @davidwendt
  • Add regex_program java APIs and unit tests (#12548) @cindyyuanjiang
  • Default cudf::io::read_json to nested JSON parser (#12544) @vuule
  • Make string quoting optional on CSV write (#12539) @mythrocks
  • Use new nvCOMP API to optimize the compression temp memory size (#12533) @vuule
  • Support "values" orient (array of arrays) in Nested JSON reader (#12498) @karthikeyann
  • one_hot_encode to use experimental row comparators (#12478) @divyegala
  • Support %W and %w format specifiers in cudf::strings::to_timestamps (#12475) @davidwendt
  • Add JSON Writer (#12474) @karthikeyann
  • Refactor thrust_copy_if into cudf::detail::copy_if_safe (#12455) @ttnghia
  • Add trailing comma support for nested JSON reader (#12448) @karthikeyann
  • Extract tokenize_json.hpp detail header from src/io/json/nested_json.hpp (#12432) @ttnghia
  • JNI bindings to write CSV (#12425) @mythrocks
  • Nested JSON depth benchmark (#12371) @karthikeyann
  • Implement lists::reverse (#12336) @ttnghia
  • Use device_read in experimental read_json (#12314) @vuule
  • Implement JNI for strings::reverse (#12283) @ttnghia
  • Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
  • Add cudf::strings:like function with multiple patterns (#12269) @davidwendt
  • Add environment variable to control host memory allocation in hostdevice_vector (#12251) @vuule
  • Add cudf::strings::reverse function (#12227) @davidwendt
  • Selectively use dictionary encoding in Parquet writer (#12211) @etseidl
  • Support replace in strings_udf (#12207) @brandon-b-miller
  • Add support to read binary encoded decimals in parquet (#12205) @PointKernel
  • Support regex EOL where the string ends with a new-line character (#12181) @davidwendt
  • Updating stream_compaction/unique to use new row comparators (#12159) @divyegala
  • Add device buffer datasource (#12024) @PointKernel
  • Implement groupby apply with JIT (#11452) @bwyogatama

πŸ› οΈ Improvements

  • Update shared workflow branches (#12696) @ajschmidt8
  • Pin dask and distributed for release (#12695) @galipremsagar
  • Don't upload libcudf-example to Anaconda.org (#12671) @ajschmidt8
  • Pin wheel dependencies to same RAPIDS release (#12659) @sevagh
  • Use CTK 118/cp310 branch of wheel workflows (#12602) @sevagh
  • Change ways to access ptr in Buffer (#12587) @galipremsagar
  • Version a parquet writer xfail (#12579) @galipremsagar
  • Remove column names (#12578) @vuule
  • Parquet reader optimization to address V100 regression. (#12577) @nvdbaranec
  • Add support for category dtypes in CSV reader (#12571) @galipremsagar
  • Remove spill_lock parameter from SpillableBuffer.get_ptr() (#12564) @madsbk
  • Optimize cudf::make_lists_column (#12547) @ttnghia
  • Remove cudf::strings::repeat_strings_output_sizes from Java and JNI (#12546) @ttnghia
  • Test that cuInit is not called when RAPIDSNOINITIALIZE is set (#12545) @wence-
  • Rework repeat_strings to use sizes-to-offsets utility (#12543) @davidwendt
  • Replace exclusivescan with sizesto_offsets in cudf::lists::sequences (#12541) @davidwendt
  • Rework nvtext::ngrams_tokenize to use sizes-to-offsets utility (#12540) @davidwendt
  • Fix binary-ops gtests coded in namespace cudf::test (#12536) @davidwendt
  • More @acquire_spill_lock() and as_buffer(..., exposed=False) (#12535) @madsbk
  • Guard CUDA runtime APIs with error checking (#12531) @PointKernel
  • Update TODOs from issue 10432. (#12528) @bdice
  • Update rapids-cmake definitions version in GitHub Actions style checks. (#12511) @bdice
  • Switch engine=cudf to the new JSON reader (#12509) @galipremsagar
  • Fix SUM/MEAN aggregation type support. (#12503) @bdice
  • Stop using pandas._testing (#12492) @vyasr
  • Fix ROLLING_TEST gtests coded in namespace cudf::test (#12490) @davidwendt
  • Fix erroneously skipped ORC ZSTD test (#12486) @vuule
  • Rework nvtext::generatecharacterngrams to use makestringschildren (#12480) @davidwendt
  • Raise warnings as errors in the test suite (#12468) @vyasr
  • Remove int32 hard-coding in python (#12467) @galipremsagar
  • Use cudaMemcpyDefault. (#12466) @bdice
  • Update workflows for nightly tests (#12462) @ajschmidt8
  • Build CUDA 11.8 and Python 3.10 Packages (#12457) @ajschmidt8
  • JNI build image default as cuda11.8 (#12441) @pxLi
  • Re-enable Recently Updated Check (#12435) @ajschmidt8
  • Rework remaining cudf::strings::fromxyz functions to use makestrings_children (#12434) @vuule
  • Build wheels alongside conda CI (#12427) @sevagh
  • Remove arguments for checking exception messages in Python (#12424) @vyasr
  • Clean up cuco usage (#12421) @PointKernel
  • Fix warnings in remaining modules (#12406) @vyasr
  • Update ops-bot.yaml (#12402) @ajschmidt8
  • Rework cudf::strings::integerstoipv4 to use makestringschildren utility (#12401) @davidwendt
  • Use numpy.empty() instead of bytearray to allocate host memory for spilling (#12399) @madsbk
  • Deprecate chunksize from daskcudf.readcsv (#12394) @rjzamora
  • Expose the RMM pool size in JNI (#12390) @revans2
  • Fix COPYING_TEST: gtests coded in namespace cudf::test (#12387) @davidwendt
  • Rework cudf::strings::urlencode to use makestrings_children utility (#12385) @davidwendt
  • Use makestringschildren in parse_data nested json reader (#12382) @karthikeyann
  • Fix warnings in test_datetime.py (#12381) @vyasr
  • Mixed Join Benchmarks (#12375) @divyegala
  • Fix warnings in dataframe.py (#12369) @vyasr
  • Update conda recipes. (#12368) @bdice
  • Use gpu-latest-1 runner tag (#12366) @bdice
  • Rework cudf::strings::frombooleans to use makestrings_children (#12365) @vuule
  • Fix warnings in test modules up to test_dataframe.py (#12355) @vyasr
  • JSON column performance optimization - struct column nulls (#12354) @karthikeyann
  • Accelerate stable-segmented-sort with CUB segmented sort (#12347) @davidwendt
  • Add size check to makeoffsetschild_column utility (#12345) @davidwendt
  • Enable max compression ratio small block optimization for ZSTD (#12338) @vuule
  • Fix warnings in test_monotonic.py (#12334) @vyasr
  • Improve JSON column creation performance (list offsets) (#12330) @karthikeyann
  • Upgrade to arrow-10.0.1 (#12327) @galipremsagar
  • Fix warnings in test_orc.py (#12326) @vyasr
  • Fix warnings in test_groupby.py (#12324) @vyasr
  • Fix test_notebooks.sh (#12323) @ajschmidt8
  • Fix transform gtests coded in namespace cudf::test (#12321) @davidwendt
  • Fix check_style.sh script (#12320) @ajschmidt8
  • Rework cudf::strings::fromtimestamps to use makestrings_children (#12317) @davidwendt
  • Fix warnings in test_index.py (#12313) @vyasr
  • Fix warnings in test_multiindex.py (#12310) @vyasr
  • CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
  • Fix warnings in test_indexing.py (#12305) @vyasr
  • Fix warnings in test_joining.py (#12304) @vyasr
  • Unpin dask and distributed for development (#12302) @galipremsagar
  • Re-enable sccache for Jenkins builds (#12297) @ajschmidt8
  • Define needs for pr-builder workflow. (#12296) @bdice
  • Forward merge 22.12 into 23.02 (#12294) @vyasr
  • Fix warnings in test_stats.py (#12293) @vyasr
  • Fix table gtests coded in namespace cudf::test (#12292) @davidwendt
  • Change cython for regex calls to use cudf::strings::regex_program (#12289) @davidwendt
  • Improved error reporting when reading multiple JSON files (#12285) @vuule
  • Deprecate Frame.sumofsquares (#12284) @vyasr
  • Remove deprecated code for 23.02 (#12281) @vyasr
  • Clean up handling of maxpagesize_bytes in Parquet writer (#12277) @etseidl
  • Fix replace gtests coded in namespace cudf::test (#12270) @davidwendt
  • Add pandas nullable type support in Index.to_pandas (#12268) @galipremsagar
  • Rework nvtext::detokenize to use indexalator for row indices (#12267) @davidwendt
  • Fix reduction gtests coded in namespace cudf::test (#12257) @davidwendt
  • Remove default parameters from cudf::detail::sort function declarations (#12254) @davidwendt
  • Add duplicated support for Series, DataFrame and Index (#12246) @galipremsagar
  • Replace column/table test utilities with macros (#12242) @PointKernel
  • Rework cudf::strings::pad and zfill to use makestringschildren (#12238) @davidwendt
  • Fix sort gtests coded in namespace cudf::test (#12237) @davidwendt
  • Wrapping concat and file writes in @acquire_spill_lock() (#12232) @madsbk
  • Rename cudf::structs::detail::superimpose_parent_nulls APIs (#12230) @ttnghia
  • Cover parsing to decimal types in read_json tests (#12229) @vuule
  • Spill Statistics (#12223) @madsbk
  • Use CUDFJNIENABLE_PROFILING to conditionally enable profiling support. (#12221) @bdice
  • Clean up of test_spilling.py (#12220) @madsbk
  • Simplify repetitive boolean logic (#12218) @vuule
  • Add Series.hasnans and Index.hasnans (#12214) @galipremsagar
  • Add cudf::strings:udf::replace function (#12210) @davidwendt
  • Adds in new java APIs for appending byte arrays to host columnar data (#12208) @revans2
  • Remove Python dependencies from Java CI. (#12193) @bdice
  • Fix null order in sort-based groupby and improve groupby tests (#12191) @divyegala
  • Move strings children functions from cudf/strings/detail/utilities.cuh to new header (#12185) @davidwendt
  • Clean up existing JNI scalar to column code (#12173) @revans2
  • Remove JIT type names, refactor idtotype. (#12158) @bdice
  • Update JNI version to 23.02.0-SNAPSHOT (#12129) @pxLi
  • Minor refactor of cpp/src/io/parquet/page_data.cu (#12126) @etseidl
  • Add codespell as a linter (#12097) @benfred
  • Enable specifying exceptions in error macros (#12078) @vyasr
  • Move _label_encoding from Series to Column (#12040) @shwina
  • Add GitHub Actions Workflows (#12002) @ajschmidt8
  • Consolidate dask-cudf groupby_agg calls in one place (#10835) @charlesbluca

- C++
Published by raydouglass about 3 years ago

https://github.com/rapidsai/cudf - v22.12.01

🚨 Breaking Changes

  • Add JNI for substring without 'end' parameter. (#12113) @firestarman
  • Refactor purge_nonempty_nulls (#12111) @ttnghia
  • Create an int8 column in read_csv when all elements are missing (#12110) @vuule
  • Throw an error when libcudf is built without cuFile and LIBCUDF_CUFILE_POLICY is set to &quot;ALWAYS&quot; (#12080) @vuule
  • Fix type promotion edge cases in numerical binops (#12074) @wence-
  • Reduce/Remove reliance on **kwargs and *args in IO readers & writers (#12025) @galipremsagar
  • Rollback of DeviceBufferLike (#12009) @madsbk
  • Remove unused managed_allocator (#12005) @vyasr
  • Pass column names to write_csv instead of table_metadata pointer (#11972) @vuule
  • Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
  • Default to equal NaNs in makemergesets_aggregation. (#11952) @bdice
  • Remove validation that requires introspection (#11938) @vyasr
  • Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
  • Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
  • Support nested types as groupby keys in libcudf (#11792) @PointKernel
  • Default to equal NaNs in makecollectset_aggregation. (#11621) @bdice
  • Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
  • part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source

πŸ› Bug Fixes

  • strings_udf: use libcudf caching of character tables (#12343) @wence-
  • Fix include line for IO Cython modules (#12250) @vyasr
  • Make dask pinning looser (#12231) @vyasr
  • Workaround for CUB segmented-sort bug with boolean keys (#12217) @davidwendt
  • Fix from_dict backend dispatch to match upstream dask (#12203) @galipremsagar
  • Merge branch-22.10 into branch-22.12 (#12198) @davidwendt
  • Fix compression in ORC writer (#12194) @vuule
  • Don't use CMake 3.25.0 as it has a show stopping FindCUDAToolkit bug (#12188) @robertmaynard
  • Fix data corruption when reading ORC files with empty stripes (#12160) @vuule
  • Fix decimal binary operations (#12142) @galipremsagar
  • Ensure dlpack include is provided to cudf interop lib (#12139) @robertmaynard
  • Safely allocate udf_string pointers in strings_udf (#12138) @brandon-b-miller
  • Fix/disable jitify lto (#12122) @robertmaynard
  • Fix conditionalfulljoin benchmark (#12121) @GregoryKimball
  • Fix regex working-memory-size refactor error (#12119) @davidwendt
  • Add in negative size checks for columns (#12118) @revans2
  • Add JNI for substring without 'end' parameter. (#12113) @firestarman
  • Fix reading of CSV files with blank second row (#12098) @vuule
  • Fix an error in IO with GzipFile type (#12085) @galipremsagar
  • Workaround groupby aggregate thrust::copy_if overflow (#12079) @davidwendt
  • Fix alignment of compressed blocks in ORC writer (#12077) @vuule
  • Fix singleton-range __setitem__ edge case (#12075) @wence-
  • Fix type promotion edge cases in numerical binops (#12074) @wence-
  • Force using old fmt in nvbench. (#12067) @vyasr
  • Fixes List offset bug in Nested JSON reader (#12060) @karthikeyann
  • Allow falling back to shim_60.ptx by default in strings_udf (#12056) @brandon-b-miller
  • Force black exclusions for pre-commit. (#12036) @bdice
  • Add memory_usage & items implementation for Struct column & dtype (#12033) @galipremsagar
  • Reduce/Remove reliance on **kwargs and *args in IO readers & writers (#12025) @galipremsagar
  • Fixes bug in csvreaderoptions construction in cython (#12021) @karthikeyann
  • Fix issues when both usecols and names options are used in read_csv (#12018) @vuule
  • Port thrust's pinned_allocator to cudf, since Thrust 1.17 removes the type (#12004) @robertmaynard
  • Revert "Replace most of preprocessor usage in nvcomp adapter with constexpr" (#11999) @vuule
  • Fix bug where df.loc resulting in single row could give wrong index (#11998) @eriknw
  • Switch to DISABLEDEPRECATIONWARNINGS to match other RAPIDS projects (#11989) @robertmaynard
  • Fix maximum page size estimate in Parquet writer (#11962) @vuule
  • Fix local offset handling in bgzip reader (#11918) @upsj
  • Fix an issue reading struct-of-list types in Parquet. (#11910) @nvdbaranec
  • Fix memcheck error in TypeInference.Timestamp gtest (#11905) @davidwendt
  • Fix type casting in Series.setitem (#11904) @wence-
  • Fix memcheck error in getdremeldata (#11903) @davidwendt
  • Fixes Unsupported column type error due to empty list columns in Nested JSON reader (#11897) @karthikeyann
  • Fix segmented-sort to ignore indices outside the offsets (#11888) @davidwendt
  • Fix cudf::stablesortedorder for NaN and -NaN in FLOAT64 columns (#11874) @davidwendt
  • Fix writing of Parquet files with many fragments (#11869) @etseidl
  • Fix RangeIndex unary operators. (#11868) @vyasr
  • JNI Avoid NPE for reading host binary data (#11865) @revans2
  • Fix decimal benchmark input data generation (#11863) @karthikeyann
  • Fix pre-commit copyright check (#11860) @galipremsagar
  • Fix Parquet support for seconds and milliseconds duration types (#11854) @vuule
  • Ensure better compiler cache results between cudf cal-ver branches (#11835) @robertmaynard
  • Fix makecolumnfrom_scalar for all-null strings column (#11807) @davidwendt
  • Tell jitify_preprocess where to search for libnvrtc (#11787) @robertmaynard
  • add V2 page header support to parquet reader (#11778) @etseidl
  • Parquet reader: bug fix for a numrows/skiprows corner case, w/optimization for nested preprocessing (#11752) @nvdbaranec
  • Determine if Arrow has S3 support at runtime in unit test. (#11560) @bdice

πŸ“– Documentation

  • Use rapidsai CODEOFCONDUCT.md (#12166) @bdice
  • Add symlinks to notebooks. (#12128) @bdice
  • Add truncate API to python doc pages (#12109) @galipremsagar
  • Update Numba docs links. (#12107) @bdice
  • Remove "Multi-GPU with Dask-cuDF" notebook. (#12095) @bdice
  • Fix link to c++ developer guide from CONTRIBUTING.md (#12084) @brandon-b-miller
  • Add pivot_table and crosstab to docs. (#12014) @bdice
  • Fix doxygen text for cudf::dictionary::encode (#11991) @davidwendt
  • Replace defaultstreamvalue with getdefaultstream in docs. (#11985) @vyasr
  • Add dtype docs pages and docstrings for cudf specific dtypes (#11974) @galipremsagar
  • Update Unit Testing in libcudf guidelines to code tests outside the cudf::test namespace (#11959) @davidwendt
  • Rename libcudf++ to libcudf. (#11953) @bdice
  • Fix documentation referring to removed asgpumatrix method. (#11937) @bdice
  • Remove "experimental" warning for struct columns in ORC reader and writer (#11880) @vuule
  • Initial draft of policies and guidelines for libcudf usage. (#11853) @vyasr
  • Add clear indication of non-GPU accelerated parameters in read_json docstring (#11825) @GregoryKimball
  • Add developer docs for writing tests (#11199) @vyasr

πŸš€ New Features

  • Adds an EventHandler to Java MemoryBuffer to be invoked on close (#12125) @abellina
  • Support + in strings_udf (#12117) @brandon-b-miller
  • Support upper and lower in strings_udf (#12099) @brandon-b-miller
  • Add wheel builds (#12096) @vyasr
  • Allow setting malloc heap size in string udfs (#12094) @brandon-b-miller
  • Support strip, lstrip, and rstrip in strings_udf (#12091) @brandon-b-miller
  • Mark nvcomp zstd compression stable (#12059) @jbrennan333
  • Add debug-only onAllocated/onDeallocated to RmmEventHandler (#12054) @abellina
  • Enable building against the libarrow contained in pyarrow (#12034) @vyasr
  • Add strings like jni and native method (#12032) @cindyyuanjiang
  • Cleanup common parsing code in JSON, CSV reader (#12022) @karthikeyann
  • byte_range support for JSON Lines format (#12017) @karthikeyann
  • Minor cleanup of root CMakeLists.txt for better organization (#11988) @robertmaynard
  • Add inplace arithmetic operators to MaskedType (#11987) @brandon-b-miller
  • Implement JNI for chunked Parquet reader (#11961) @ttnghia
  • Add method argument to DataFrame.quantile (#11957) @rjzamora
  • Add gpu memory watermark apis to JNI (#11950) @abellina
  • Adds retryCount to RmmEventHandler.onAllocFailure (#11940) @abellina
  • Enable returning string data from UDFs used through apply (#11933) @brandon-b-miller
  • Switch over to rapids-cmake patches for thrust (#11921) @robertmaynard
  • Add strings udf C++ classes and functions for phase II (#11912) @davidwendt
  • Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
  • Enable CEC for strings_udf (#11884) @brandon-b-miller
  • ArrowIPCTableWriter writes en empty batch in the case of an empty table. (#11883) @firestarman
  • Implement chunked Parquet reader (#11867) @ttnghia
  • Add read_orc_metadata to libcudf (#11815) @vuule
  • Support nested types as groupby keys in libcudf (#11792) @PointKernel
  • Adding feature Truncate to DataFrame and Series (#11435) @VamsiTallam95

πŸ› οΈ Improvements

  • Reduce number of tests marked spilling (#12197) @madsbk
  • Pin dask and distributed for release (#12165) @galipremsagar
  • Don't rely on GNU find in headers_test.sh (#12164) @wence-
  • Update cp.clip call (#12148) @quasiben
  • Enable automatic column projection in groupby().agg (#12124) @rjzamora
  • Refactor purge_nonempty_nulls (#12111) @ttnghia
  • Create an int8 column in read_csv when all elements are missing (#12110) @vuule
  • Spilling to host memory (#12106) @madsbk
  • First pass of pd.read_orc changes in tests (#12103) @galipremsagar
  • Expose engine argument in daskcudf.readjson (#12101) @rjzamora
  • Remove CUDA 10 compatibility code. (#12088) @bdice
  • Move and update dask nigthly install in CI (#12082) @galipremsagar
  • Throw an error when libcudf is built without cuFile and LIBCUDF_CUFILE_POLICY is set to &quot;ALWAYS&quot; (#12080) @vuule
  • Remove macros that inspect the contents of exceptions (#12076) @vyasr
  • Fix ingestrawdata performance issue in Nested JSON reader due to RVO (#12070) @karthikeyann
  • Remove overflow error during decimal binops (#12063) @galipremsagar
  • Change cudf::detail::tdigest to cudf::tdigest::detail (#12050) @davidwendt
  • Fix quantile gtests coded in namespace cudf::test (#12049) @davidwendt
  • Add support for DataFrame.from_dict`todictandSeries.todict` (#12048) @galipremsagar
  • Refactor Parquet reader (#12046) @ttnghia
  • Forward merge 22.10 into 22.12 (#12045) @vyasr
  • Standardize newlines at ends of files. (#12042) @bdice
  • Trim trailing whitespace from all files. (#12041) @bdice
  • Use nosync policy in gather and scatter implementations. (#12038) @bdice
  • Remove smart quotes from all docstrings. (#12035) @bdice
  • Update cuda-python dependency to 11.7.1 (#12030) @galipremsagar
  • Add cython-lint to pre-commit checks. (#12020) @bdice
  • Use pragma once (#12019) @bdice
  • New GHA to add issues/prs to project board (#12016) @jarmak-nv
  • Add DataFrame.pivot_table. (#12015) @bdice
  • Rollback of DeviceBufferLike (#12009) @madsbk
  • Remove default parameters for nvtext::detail functions (#12007) @davidwendt
  • Remove default parameters for cudf::dictionary::detail functions (#12006) @davidwendt
  • Remove unused managed_allocator (#12005) @vyasr
  • Remove default parameters for cudf::strings::detail functions (#12003) @davidwendt
  • Remove unnecessary code from dask-cudf _Frame (#12001) @rjzamora
  • Ignore python docs build artifacts (#12000) @galipremsagar
  • Use rapids-cmake for google benchmark. (#11997) @vyasr
  • Leverage rapids_cython for more automated RPATH handling (#11996) @vyasr
  • Remove stale labeler (#11995) @raydouglass
  • Move protobuf compilation to CMake (#11986) @vyasr
  • Replace most of preprocessor usage in nvcomp adapter with constexpr (#11980) @vuule
  • Add missing noexcepts to columninmetadata methods (#11973) @vyasr
  • Pass column names to write_csv instead of table_metadata pointer (#11972) @vuule
  • Accelerate libcudf segmented sort with CUB segmented sort (#11969) @davidwendt
  • Feature/remove default streams (#11967) @vyasr
  • Add pool memory resource to libcudf basic example (#11966) @davidwendt
  • Fix some libcudf calls to cudf::detail::gather (#11963) @davidwendt
  • Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
  • Add deprecation warning for set_allocator. (#11958) @vyasr
  • Fix lists and structs gtests coded in namespace cudf::test (#11956) @davidwendt
  • Add full page indexes to Parquet writer benchmarks (#11955) @etseidl
  • Use gather-based strings factory in cudf::strings::strip (#11954) @davidwendt
  • Default to equal NaNs in makemergesets_aggregation. (#11952) @bdice
  • Add strip_delimiters option to read_text (#11946) @upsj
  • Refactor multibytesplit `outputbuilder` (#11945) @upsj
  • Remove validation that requires introspection (#11938) @vyasr
  • Add .str.find_multiple API (#11928) @galipremsagar
  • Add regex_program class for use with all regex APIs (#11927) @davidwendt
  • Enable backend dispatching for Dask-DataFrame creation (#11920) @rjzamora
  • Performance improvement in JSON Tree traversal (#11919) @karthikeyann
  • Fix some gtests incorrectly coded in namespace cudf::test (part I) (#11917) @davidwendt
  • Refactor pad/zfill functions for reuse with strings udf (#11914) @davidwendt
  • Add nanosecond & microsecond to DatetimeProperties (#11911) @galipremsagar
  • Pin mimesis version in setup.py. (#11906) @bdice
  • Error on ListColumn or any new unsupported column in cudf.Index (#11902) @galipremsagar
  • Add thrust output iterator fix (1805) to thrust.patch (#11900) @davidwendt
  • Relax codecov threshold diff (#11899) @galipremsagar
  • Use public APIs in STREAMCOMPACTIONNVBENCH (#11892) @GregoryKimball
  • Add coverage for string UDF tests. (#11891) @vyasr
  • Provide data_chunk_source wrapper for datasource (#11886) @upsj
  • Handle multibyte_split byte_range out-of-bounds offsets on host (#11885) @upsj
  • Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
  • Change expectstringsempty into expectcolumnempty libcudf test utility (#11873) @davidwendt
  • Add ngroup (#11871) @shwina
  • Reduce memory usage in nested JSON parser - tree generation (#11864) @karthikeyann
  • Unpin dask and distributed for development (#11859) @galipremsagar
  • Remove unused includes for table/row_operators (#11857) @GregoryKimball
  • Use conda-forge's pyorc (#11855) @jakirkham
  • Add libcudf strings examples (#11849) @davidwendt
  • Remove cudf_io namespace alias (#11827) @vuule
  • Test/remove thrust vector usage (#11813) @vyasr
  • Add BGZIP reader to python read_text (#11802) @upsj
  • Merge branch-22.10 into branch-22.12 (#11801) @davidwendt
  • Fix compile warning from CUDFFUNCRANGE in a member function (#11798) @davidwendt
  • Update cudf JNI version to 22.12.0-SNAPSHOT (#11764) @pxLi
  • Update flake8 to 5.0.4 and use flake8-force to check Cython. (#11736) @bdice
  • Add BGZIP multibyte_split benchmark (#11723) @upsj
  • Bifurcate Dependency Lists (#11674) @bdice
  • Default to equal NaNs in makecollectset_aggregation. (#11621) @bdice
  • Conform "bench_isin" to match generator column names (#11549) @GregoryKimball
  • Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
  • Add checks for HLG layers in dask-cudf groupby tests (#10853) @charlesbluca
  • part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
  • Make all nvcc warnings into errors (#8916) @trxcllnt

- C++
Published by GPUtester about 3 years ago

https://github.com/rapidsai/cudf - v22.12.00

🚨 Breaking Changes

  • Add JNI for substring without 'end' parameter. (#12113) @firestarman
  • Refactor purge_nonempty_nulls (#12111) @ttnghia
  • Create an int8 column in read_csv when all elements are missing (#12110) @vuule
  • Throw an error when libcudf is built without cuFile and LIBCUDF_CUFILE_POLICY is set to &quot;ALWAYS&quot; (#12080) @vuule
  • Fix type promotion edge cases in numerical binops (#12074) @wence-
  • Reduce/Remove reliance on **kwargs and *args in IO readers & writers (#12025) @galipremsagar
  • Rollback of DeviceBufferLike (#12009) @madsbk
  • Remove unused managed_allocator (#12005) @vyasr
  • Pass column names to write_csv instead of table_metadata pointer (#11972) @vuule
  • Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
  • Default to equal NaNs in makemergesets_aggregation. (#11952) @bdice
  • Remove validation that requires introspection (#11938) @vyasr
  • Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
  • Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
  • Support nested types as groupby keys in libcudf (#11792) @PointKernel
  • Default to equal NaNs in makecollectset_aggregation. (#11621) @bdice
  • Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
  • part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source

πŸ› Bug Fixes

  • Fix include line for IO Cython modules (#12250) @vyasr
  • Make dask pinning looser (#12231) @vyasr
  • Workaround for CUB segmented-sort bug with boolean keys (#12217) @davidwendt
  • Fix from_dict backend dispatch to match upstream dask (#12203) @galipremsagar
  • Merge branch-22.10 into branch-22.12 (#12198) @davidwendt
  • Fix compression in ORC writer (#12194) @vuule
  • Don't use CMake 3.25.0 as it has a show stopping FindCUDAToolkit bug (#12188) @robertmaynard
  • Fix data corruption when reading ORC files with empty stripes (#12160) @vuule
  • Fix decimal binary operations (#12142) @galipremsagar
  • Ensure dlpack include is provided to cudf interop lib (#12139) @robertmaynard
  • Safely allocate udf_string pointers in strings_udf (#12138) @brandon-b-miller
  • Fix/disable jitify lto (#12122) @robertmaynard
  • Fix conditionalfulljoin benchmark (#12121) @GregoryKimball
  • Fix regex working-memory-size refactor error (#12119) @davidwendt
  • Add in negative size checks for columns (#12118) @revans2
  • Add JNI for substring without 'end' parameter. (#12113) @firestarman
  • Fix reading of CSV files with blank second row (#12098) @vuule
  • Fix an error in IO with GzipFile type (#12085) @galipremsagar
  • Workaround groupby aggregate thrust::copy_if overflow (#12079) @davidwendt
  • Fix alignment of compressed blocks in ORC writer (#12077) @vuule
  • Fix singleton-range __setitem__ edge case (#12075) @wence-
  • Fix type promotion edge cases in numerical binops (#12074) @wence-
  • Force using old fmt in nvbench. (#12067) @vyasr
  • Fixes List offset bug in Nested JSON reader (#12060) @karthikeyann
  • Allow falling back to shim_60.ptx by default in strings_udf (#12056) @brandon-b-miller
  • Force black exclusions for pre-commit. (#12036) @bdice
  • Add memory_usage & items implementation for Struct column & dtype (#12033) @galipremsagar
  • Reduce/Remove reliance on **kwargs and *args in IO readers & writers (#12025) @galipremsagar
  • Fixes bug in csvreaderoptions construction in cython (#12021) @karthikeyann
  • Fix issues when both usecols and names options are used in read_csv (#12018) @vuule
  • Port thrust's pinned_allocator to cudf, since Thrust 1.17 removes the type (#12004) @robertmaynard
  • Revert "Replace most of preprocessor usage in nvcomp adapter with constexpr" (#11999) @vuule
  • Fix bug where df.loc resulting in single row could give wrong index (#11998) @eriknw
  • Switch to DISABLEDEPRECATIONWARNINGS to match other RAPIDS projects (#11989) @robertmaynard
  • Fix maximum page size estimate in Parquet writer (#11962) @vuule
  • Fix local offset handling in bgzip reader (#11918) @upsj
  • Fix an issue reading struct-of-list types in Parquet. (#11910) @nvdbaranec
  • Fix memcheck error in TypeInference.Timestamp gtest (#11905) @davidwendt
  • Fix type casting in Series.setitem (#11904) @wence-
  • Fix memcheck error in getdremeldata (#11903) @davidwendt
  • Fixes Unsupported column type error due to empty list columns in Nested JSON reader (#11897) @karthikeyann
  • Fix segmented-sort to ignore indices outside the offsets (#11888) @davidwendt
  • Fix cudf::stablesortedorder for NaN and -NaN in FLOAT64 columns (#11874) @davidwendt
  • Fix writing of Parquet files with many fragments (#11869) @etseidl
  • Fix RangeIndex unary operators. (#11868) @vyasr
  • JNI Avoid NPE for reading host binary data (#11865) @revans2
  • Fix decimal benchmark input data generation (#11863) @karthikeyann
  • Fix pre-commit copyright check (#11860) @galipremsagar
  • Fix Parquet support for seconds and milliseconds duration types (#11854) @vuule
  • Ensure better compiler cache results between cudf cal-ver branches (#11835) @robertmaynard
  • Fix makecolumnfrom_scalar for all-null strings column (#11807) @davidwendt
  • Tell jitify_preprocess where to search for libnvrtc (#11787) @robertmaynard
  • add V2 page header support to parquet reader (#11778) @etseidl
  • Parquet reader: bug fix for a numrows/skiprows corner case, w/optimization for nested preprocessing (#11752) @nvdbaranec
  • Determine if Arrow has S3 support at runtime in unit test. (#11560) @bdice

πŸ“– Documentation

  • Use rapidsai CODEOFCONDUCT.md (#12166) @bdice
  • Add symlinks to notebooks. (#12128) @bdice
  • Add truncate API to python doc pages (#12109) @galipremsagar
  • Update Numba docs links. (#12107) @bdice
  • Remove "Multi-GPU with Dask-cuDF" notebook. (#12095) @bdice
  • Fix link to c++ developer guide from CONTRIBUTING.md (#12084) @brandon-b-miller
  • Add pivot_table and crosstab to docs. (#12014) @bdice
  • Fix doxygen text for cudf::dictionary::encode (#11991) @davidwendt
  • Replace defaultstreamvalue with getdefaultstream in docs. (#11985) @vyasr
  • Add dtype docs pages and docstrings for cudf specific dtypes (#11974) @galipremsagar
  • Update Unit Testing in libcudf guidelines to code tests outside the cudf::test namespace (#11959) @davidwendt
  • Rename libcudf++ to libcudf. (#11953) @bdice
  • Fix documentation referring to removed asgpumatrix method. (#11937) @bdice
  • Remove "experimental" warning for struct columns in ORC reader and writer (#11880) @vuule
  • Initial draft of policies and guidelines for libcudf usage. (#11853) @vyasr
  • Add clear indication of non-GPU accelerated parameters in read_json docstring (#11825) @GregoryKimball
  • Add developer docs for writing tests (#11199) @vyasr

πŸš€ New Features

  • Adds an EventHandler to Java MemoryBuffer to be invoked on close (#12125) @abellina
  • Support + in strings_udf (#12117) @brandon-b-miller
  • Support upper and lower in strings_udf (#12099) @brandon-b-miller
  • Add wheel builds (#12096) @vyasr
  • Allow setting malloc heap size in string udfs (#12094) @brandon-b-miller
  • Support strip, lstrip, and rstrip in strings_udf (#12091) @brandon-b-miller
  • Mark nvcomp zstd compression stable (#12059) @jbrennan333
  • Add debug-only onAllocated/onDeallocated to RmmEventHandler (#12054) @abellina
  • Enable building against the libarrow contained in pyarrow (#12034) @vyasr
  • Add strings like jni and native method (#12032) @cindyyuanjiang
  • Cleanup common parsing code in JSON, CSV reader (#12022) @karthikeyann
  • byte_range support for JSON Lines format (#12017) @karthikeyann
  • Minor cleanup of root CMakeLists.txt for better organization (#11988) @robertmaynard
  • Add inplace arithmetic operators to MaskedType (#11987) @brandon-b-miller
  • Implement JNI for chunked Parquet reader (#11961) @ttnghia
  • Add method argument to DataFrame.quantile (#11957) @rjzamora
  • Add gpu memory watermark apis to JNI (#11950) @abellina
  • Adds retryCount to RmmEventHandler.onAllocFailure (#11940) @abellina
  • Enable returning string data from UDFs used through apply (#11933) @brandon-b-miller
  • Switch over to rapids-cmake patches for thrust (#11921) @robertmaynard
  • Add strings udf C++ classes and functions for phase II (#11912) @davidwendt
  • Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
  • Enable CEC for strings_udf (#11884) @brandon-b-miller
  • ArrowIPCTableWriter writes en empty batch in the case of an empty table. (#11883) @firestarman
  • Implement chunked Parquet reader (#11867) @ttnghia
  • Add read_orc_metadata to libcudf (#11815) @vuule
  • Support nested types as groupby keys in libcudf (#11792) @PointKernel
  • Adding feature Truncate to DataFrame and Series (#11435) @VamsiTallam95

πŸ› οΈ Improvements

  • Reduce number of tests marked spilling (#12197) @madsbk
  • Pin dask and distributed for release (#12165) @galipremsagar
  • Don't rely on GNU find in headers_test.sh (#12164) @wence-
  • Update cp.clip call (#12148) @quasiben
  • Enable automatic column projection in groupby().agg (#12124) @rjzamora
  • Refactor purge_nonempty_nulls (#12111) @ttnghia
  • Create an int8 column in read_csv when all elements are missing (#12110) @vuule
  • Spilling to host memory (#12106) @madsbk
  • First pass of pd.read_orc changes in tests (#12103) @galipremsagar
  • Expose engine argument in daskcudf.readjson (#12101) @rjzamora
  • Remove CUDA 10 compatibility code. (#12088) @bdice
  • Move and update dask nigthly install in CI (#12082) @galipremsagar
  • Throw an error when libcudf is built without cuFile and LIBCUDF_CUFILE_POLICY is set to &quot;ALWAYS&quot; (#12080) @vuule
  • Remove macros that inspect the contents of exceptions (#12076) @vyasr
  • Fix ingestrawdata performance issue in Nested JSON reader due to RVO (#12070) @karthikeyann
  • Remove overflow error during decimal binops (#12063) @galipremsagar
  • Change cudf::detail::tdigest to cudf::tdigest::detail (#12050) @davidwendt
  • Fix quantile gtests coded in namespace cudf::test (#12049) @davidwendt
  • Add support for DataFrame.from_dict`todictandSeries.todict` (#12048) @galipremsagar
  • Refactor Parquet reader (#12046) @ttnghia
  • Forward merge 22.10 into 22.12 (#12045) @vyasr
  • Standardize newlines at ends of files. (#12042) @bdice
  • Trim trailing whitespace from all files. (#12041) @bdice
  • Use nosync policy in gather and scatter implementations. (#12038) @bdice
  • Remove smart quotes from all docstrings. (#12035) @bdice
  • Update cuda-python dependency to 11.7.1 (#12030) @galipremsagar
  • Add cython-lint to pre-commit checks. (#12020) @bdice
  • Use pragma once (#12019) @bdice
  • New GHA to add issues/prs to project board (#12016) @jarmak-nv
  • Add DataFrame.pivot_table. (#12015) @bdice
  • Rollback of DeviceBufferLike (#12009) @madsbk
  • Remove default parameters for nvtext::detail functions (#12007) @davidwendt
  • Remove default parameters for cudf::dictionary::detail functions (#12006) @davidwendt
  • Remove unused managed_allocator (#12005) @vyasr
  • Remove default parameters for cudf::strings::detail functions (#12003) @davidwendt
  • Remove unnecessary code from dask-cudf _Frame (#12001) @rjzamora
  • Ignore python docs build artifacts (#12000) @galipremsagar
  • Use rapids-cmake for google benchmark. (#11997) @vyasr
  • Leverage rapids_cython for more automated RPATH handling (#11996) @vyasr
  • Remove stale labeler (#11995) @raydouglass
  • Move protobuf compilation to CMake (#11986) @vyasr
  • Replace most of preprocessor usage in nvcomp adapter with constexpr (#11980) @vuule
  • Add missing noexcepts to columninmetadata methods (#11973) @vyasr
  • Pass column names to write_csv instead of table_metadata pointer (#11972) @vuule
  • Accelerate libcudf segmented sort with CUB segmented sort (#11969) @davidwendt
  • Feature/remove default streams (#11967) @vyasr
  • Add pool memory resource to libcudf basic example (#11966) @davidwendt
  • Fix some libcudf calls to cudf::detail::gather (#11963) @davidwendt
  • Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
  • Add deprecation warning for set_allocator. (#11958) @vyasr
  • Fix lists and structs gtests coded in namespace cudf::test (#11956) @davidwendt
  • Add full page indexes to Parquet writer benchmarks (#11955) @etseidl
  • Use gather-based strings factory in cudf::strings::strip (#11954) @davidwendt
  • Default to equal NaNs in makemergesets_aggregation. (#11952) @bdice
  • Add strip_delimiters option to read_text (#11946) @upsj
  • Refactor multibytesplit `outputbuilder` (#11945) @upsj
  • Remove validation that requires introspection (#11938) @vyasr
  • Add .str.find_multiple API (#11928) @galipremsagar
  • Add regex_program class for use with all regex APIs (#11927) @davidwendt
  • Enable backend dispatching for Dask-DataFrame creation (#11920) @rjzamora
  • Performance improvement in JSON Tree traversal (#11919) @karthikeyann
  • Fix some gtests incorrectly coded in namespace cudf::test (part I) (#11917) @davidwendt
  • Refactor pad/zfill functions for reuse with strings udf (#11914) @davidwendt
  • Add nanosecond & microsecond to DatetimeProperties (#11911) @galipremsagar
  • Pin mimesis version in setup.py. (#11906) @bdice
  • Error on ListColumn or any new unsupported column in cudf.Index (#11902) @galipremsagar
  • Add thrust output iterator fix (1805) to thrust.patch (#11900) @davidwendt
  • Relax codecov threshold diff (#11899) @galipremsagar
  • Use public APIs in STREAMCOMPACTIONNVBENCH (#11892) @GregoryKimball
  • Add coverage for string UDF tests. (#11891) @vyasr
  • Provide data_chunk_source wrapper for datasource (#11886) @upsj
  • Handle multibyte_split byte_range out-of-bounds offsets on host (#11885) @upsj
  • Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
  • Change expectstringsempty into expectcolumnempty libcudf test utility (#11873) @davidwendt
  • Add ngroup (#11871) @shwina
  • Reduce memory usage in nested JSON parser - tree generation (#11864) @karthikeyann
  • Unpin dask and distributed for development (#11859) @galipremsagar
  • Remove unused includes for table/row_operators (#11857) @GregoryKimball
  • Use conda-forge's pyorc (#11855) @jakirkham
  • Add libcudf strings examples (#11849) @davidwendt
  • Remove cudf_io namespace alias (#11827) @vuule
  • Test/remove thrust vector usage (#11813) @vyasr
  • Add BGZIP reader to python read_text (#11802) @upsj
  • Merge branch-22.10 into branch-22.12 (#11801) @davidwendt
  • Fix compile warning from CUDFFUNCRANGE in a member function (#11798) @davidwendt
  • Update cudf JNI version to 22.12.0-SNAPSHOT (#11764) @pxLi
  • Update flake8 to 5.0.4 and use flake8-force to check Cython. (#11736) @bdice
  • Add BGZIP multibyte_split benchmark (#11723) @upsj
  • Bifurcate Dependency Lists (#11674) @bdice
  • Default to equal NaNs in makecollectset_aggregation. (#11621) @bdice
  • Conform "bench_isin" to match generator column names (#11549) @GregoryKimball
  • Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
  • Add checks for HLG layers in dask-cudf groupby tests (#10853) @charlesbluca
  • part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
  • Make all nvcc warnings into errors (#8916) @trxcllnt

- C++
Published by GPUtester about 3 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v22.10.00

πŸ”— Links

🚨 Breaking Changes

  • Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
  • Disable nvCOMP DEFLATE integration (#11811) @vuule
  • Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
  • Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
  • Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
  • Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
  • Update zfill to match Python output (#11634) @davidwendt
  • Upgrade pandas to 1.5 (#11617) @galipremsagar
  • Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
  • Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
  • Adding optional parquet reader schema (#11524) @hyperbolic2346
  • Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
  • Remove support for skiprows / numrows options in the parquet reader. (#11503) @nvdbaranec
  • Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
  • Disable Arrow S3 support by default. (#11470) @bdice
  • Convert thrust::optional usages to std::optional (#11455) @robertmaynard
  • Remove unused is_struct trait. (#11450) @bdice
  • Refactor the Buffer class (#11447) @madsbk
  • Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
  • Refactor padside and striptype enums into side_type enum (#11438) @davidwendt
  • Remove HASHSERIALMURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
  • Use the new JSON parser when the experimental reader is selected (#11364) @vuule
  • Remove deprecated Series.applymap. (#11031) @bdice
  • Remove deprecated expand parameter from str.findall. (#11030) @bdice

πŸ› Bug Fixes

  • Force using old fmt in nvbench. (#12064) @vyasr
  • Update cuda-python dependency to 11.7.1 (#11994) @shwina
  • Fixes bug in temporary decompression space estimation before calling nvcomp (#11879) @abellina
  • Handle ptx file paths during strings_udf import (#11862) @galipremsagar
  • Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
  • Reset strings_udf CEC and solve several related issues (#11846) @brandon-b-miller
  • Fix bug in new shuffle-based groupby implementation (#11836) @rjzamora
  • Fix is_valid checks in Scalar._binaryop (#11818) @wence-
  • Fix operator NotImplemented issue with numpy (#11816) @galipremsagar
  • Disable nvCOMP DEFLATE integration (#11811) @vuule
  • Build strings_udf package with other python packages in nightlies (#11808) @brandon-b-miller
  • Revert problematic shuffle=explicit-comms changes (#11803) @rjzamora
  • Fix regex out-of-bounds write in strided rows logic (#11797) @davidwendt
  • Build cudf locally before building strings_udf conda packages in CI (#11785) @brandon-b-miller
  • Fix an issue in cudf::rowbitcount involving structs and lists at multiple levels. (#11779) @nvdbaranec
  • Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
  • Fix issue with set-item incase of list and struct types (#11760) @galipremsagar
  • Ensure all libcudf APIs run on cudf's default stream (#11759) @vyasr
  • Resolve dask_cudf failures caused by upstream groupby changes (#11755) @rjzamora
  • Fix ORC string sum statistics (#11740) @vuule
  • Add strings_udf package for python 3.9 (#11730) @brandon-b-miller
  • Ensure that all tests launch kernels on cudf's default stream (#11726) @vyasr
  • Don't assume stream is a compile-time constant expression (#11725) @vyasr
  • Fix get_thrust.cmake format at patch command (#11715) @davidwendt
  • Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
  • Fix cudf::lists::sort_lists for NaN and Infinity values (#11703) @davidwendt
  • Modify ORC reader timestamp parsing to match the apache reader behavior (#11699) @vuule
  • Fix DataFrame.from_arrow to preserve type metadata (#11698) @galipremsagar
  • Fix compile error due to missing header (#11697) @ttnghia
  • Default to Snappy compression in to_orc when using cuDF or Dask (#11690) @vuule
  • Fix an issue related to Multindex when group_keys=True (#11689) @galipremsagar
  • Transfer correct dtype to exploded column (#11687) @wence-
  • Ignore protobuf generated files in mypy checks (#11685) @galipremsagar
  • Maintain the index name after .loc (#11677) @shwina
  • Fix issue with extracting nested column data & dtype preservation (#11671) @galipremsagar
  • Ensure that all cudf tests and benchmarks are conda env aware (#11666) @robertmaynard
  • Update to Thrust 1.17.2 to fix cub ODR issues (#11665) @robertmaynard
  • Fix multi-file remote datasource bug (#11655) @rjzamora
  • Fix invalid regex quantifier check to not include alternation (#11654) @davidwendt
  • Fix bug in device_write(): it uses an incorrect size (#11651) @madsbk
  • fixes overflows in benchmarks (#11649) @elstehle
  • Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
  • Fix compile error in benchmark nested_json.cpp (#11637) @davidwendt
  • Update zfill to match Python output (#11634) @davidwendt
  • Removed converted type for INT32 and INT64 since they do not convert (#11627) @hyperbolic2346
  • Fix host scalars construction of nested types (#11612) @galipremsagar
  • Fix compile warning in nestedjsongpu.cu (#11607) @davidwendt
  • Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
  • Preserve order if necessary when deduping categoricals internally (#11597) @brandon-b-miller
  • Add is_timestamp test for leap second (60) (#11594) @davidwendt
  • Fix an issue with to_arrow when column name type is not a string (#11590) @galipremsagar
  • Fix exception in segmented-reduce benchmark (#11588) @davidwendt
  • Fix encode/decode of negative timestamps in ORC reader/writer (#11586) @vuule
  • Correct distribution data type in quantiles benchmark (#11584) @vuule
  • Fix multibyte_split benchmark for host buffers (#11583) @upsj
  • xfail custreamz display test for now (#11567) @shwina
  • Fix JNI for TableWithMeta to use schemainfo instead of columnnames (#11566) @jlowe
  • Reduce code duplication for dask & distributed nightly/stable installs (#11565) @galipremsagar
  • Fix groupby failures in dask_cudf CI (#11561) @rjzamora
  • Fix for pivot: error when 'values' is a multicharacter string (#11538) @shaswat-indian
  • find_package(cudf) + arrow9 usable with cudf build directory (#11535) @robertmaynard
  • Fixing crash when writing binary nested data in parquet (#11526) @hyperbolic2346
  • Fix for: error when assigning a value to an empty series (#11523) @shaswat-indian
  • Fix invalid results from conditional-left-anti-join in debug build (#11517) @davidwendt
  • Fix cmake error after upgrading to Arrow 9 (#11513) @ttnghia
  • Fix reverse binary operators acting on a host value and cudf.Scalar (#11512) @bdice
  • Update parquet fuzz tests to drop support for skiprows & num_rows (#11505) @galipremsagar
  • Use rapids-cmake 22.10 best practice for RAPIDS.cmake location (#11493) @robertmaynard
  • Handle some zero-sized corner cases in dlpack interop (#11449) @wence-
  • Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
  • libcudf c++ example updated to CPM version 0.35.3 (#11417) @robertmaynard
  • Fix regex quantifier check to include capture groups (#11373) @davidwendt
  • Fix readtext when byterange is aligned with field (#11371) @upsj
  • Fix to_timestamps truncated subsecond calculation (#11367) @davidwendt
  • column: calculate null_count before release()ing the cudf::column (#11365) @wence-

πŸ“– Documentation

  • Update guide-to-udfs notebook (#11861) @brandon-b-miller
  • Update docstring for cudf.read_text (#11799) @GregoryKimball
  • Add doc section for list & struct handling (#11770) @galipremsagar
  • Document that minimum required CMake version is now 3.23.1 (#11751) @robertmaynard
  • Update libcudf documentation build command in DOCUMENTATION.md (#11735) @davidwendt
  • Add docs for use of string data to DataFrame.apply and Series.apply and update guide to UDFs notebook (#11733) @brandon-b-miller
  • Enable more Pydocstyle rules (#11582) @bdice
  • Remove unused cpp/img folder (#11554) @davidwendt
  • Publish C++ developer docs (#11475) @vyasr
  • Fix a misalignment in cudf.get_dummies docstring (#11443) @galipremsagar
  • Update contributing doc to include links to the developer guides (#11390) @davidwendt
  • Fix tableviewbase doxygen format (#11340) @davidwendt
  • Create main developer guide for Python (#11235) @vyasr
  • Add developer documentation for benchmarking (#11122) @vyasr
  • cuDF error handling document (#7917) @isVoid

πŸš€ New Features

  • Add hasNull statistic reading ability to ORC (#11747) @devavret
  • Add istitle to string UDFs (#11738) @brandon-b-miller
  • JSON Column creation in GPU (#11714) @karthikeyann
  • Adds option to take explicit nested schema for nested JSON reader (#11682) @elstehle
  • Add BGZIP data_chunk_reader (#11652) @upsj
  • Support DECIMAL order-by for RANGE window functions (#11645) @mythrocks
  • changing version of cmake to 3.23.3 (#11619) @hyperbolic2346
  • Generate unique keys table in java JNI contiguousSplitGroups (#11614) @res-life
  • Generic type casting to support the new nested JSON reader (#11613) @elstehle
  • JSON tree traversal (#11610) @karthikeyann
  • Add casting operators to masked UDFs (#11578) @brandon-b-miller
  • Adds type inference and type conversion for leaf-columns to the nested JSON parser (#11574) @elstehle
  • Add strings 'like' function (#11558) @davidwendt
  • Handle hyphen as literal for regex cclass when incomplete range (#11557) @davidwendt
  • Enable ZSTD compression in ORC and Parquet writers (#11551) @vuule
  • Adds support for json lines format to the nested JSON reader (#11534) @elstehle
  • Adding optional parquet reader schema (#11524) @hyperbolic2346
  • Adds GPU implementation of JSON-token-stream to JSON-tree (#11518) @karthikeyann
  • Add gdb pretty-printers for simple types (#11499) @upsj
  • Add create_random_column function to the data generator (#11490) @vuule
  • Add fluent API builder to data_profile (#11479) @vuule
  • Adds Nested Json benchmark (#11466) @karthikeyann
  • Convert thrust::optional usages to std::optional (#11455) @robertmaynard
  • Python API for the future experimental JSON reader (#11426) @vuule
  • Return schema info from JSON reader (#11419) @vuule
  • Add regex ASCII flag support for matching builtin character classes (#11404) @davidwendt
  • Truncate parquet column indexes (#11403) @etseidl
  • Adds the end-to-end JSON parser implementation (#11388) @elstehle
  • Use the new JSON parser when the experimental reader is selected (#11364) @vuule
  • Add placeholder for the experimental JSON reader (#11334) @vuule
  • Add read-only functions on string dtypes to DataFrame.apply and Series.apply (#11319) @brandon-b-miller
  • Added 'crosstab' and 'pivot_table' features (#11314) @shaswat-indian
  • Quickly error out when trying to build with unsupported nvcc versions (#11297) @robertmaynard
  • Adds JSON tokenizer (#11264) @elstehle
  • List lexicographic comparator (#11129) @devavret
  • Add generic type inference for cuIO (#11121) @PointKernel
  • Fully support nested types in cudf::contains (#10656) @ttnghia
  • Support nested types in lists::contains (#10548) @ttnghia

πŸ› οΈ Improvements

  • Pin dask and distributed for release (#11822) @galipremsagar
  • Add examples for Nested JSON reader (#11814) @GregoryKimball
  • Support shuffle-based groupby aggregations in dask_cudf (#11800) @rjzamora
  • Update strings udf version updater script (#11772) @galipremsagar
  • Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
  • Pass dtype param to avoid pd.Series warnings (#11761) @galipremsagar
  • Enable schema_element & keep_quotes support in json reader (#11746) @galipremsagar
  • Add ability to construct ListColumn when size is None (#11745) @galipremsagar
  • Reduces memory requirements in JSON parser and adds bytes/s and peak memory usage to benchmarks (#11732) @elstehle
  • Add missing copyright headers. (#11712) @bdice
  • Fix copyright check issues in pre-commit (#11711) @bdice
  • Include decimal in supported types for range window order-by columns (#11710) @mythrocks
  • Disable very large column gtest for contiguous-split (#11706) @davidwendt
  • Drop split_out=None test from groupby.agg (#11704) @wence-
  • Use CubinLinker for CUDA Minor Version Compatibility (#11701) @gmarkall
  • Add regex capture-group parameter to auto convert to non-capture groups (#11695) @davidwendt
  • Add a __dataframe__ method to the protocol dataframe object (#11692) @rgommers
  • Special-case multibyte_split for single-byte delimiter (#11681) @upsj
  • Remove isort exclusions (#11680) @bdice
  • Refactor CSV reader benchmarks with nvbench (#11678) @PointKernel
  • Check conda recipe headers with pre-commit (#11669) @bdice
  • Remove redundant style check for clang-format. (#11668) @bdice
  • Add support for group_keys in groupby (#11659) @galipremsagar
  • Fix pandoc pinning. (#11658) @bdice
  • Revert removal of skiprows / numrows options from the Parquet reader. (#11657) @nvdbaranec
  • Update git metadata (#11647) @bdice
  • Call setnullcount on a returning column if null-count is known (#11646) @davidwendt
  • Fix some libcudf detail calls not passing the stream variable (#11642) @davidwendt
  • Update to mypy 0.971 (#11640) @wence-
  • Refactor strings strip functor to details header (#11635) @davidwendt
  • Fix incorrect nullCount in get_json_object (#11633) @trxcllnt
  • Simplify hostdevice_vector (#11631) @upsj
  • Refactor parquet writer benchmarks with nvbench (#11623) @PointKernel
  • Rework contains_scalar to check nulls at runtime (#11622) @davidwendt
  • Fix incorrect memory resource used in rolling temp columns (#11618) @mythrocks
  • Upgrade pandas to 1.5 (#11617) @galipremsagar
  • Move type-dispatcher calls from traits.hpp to traits.cpp (#11616) @davidwendt
  • Refactor parquet reader benchmarks with nvbench (#11611) @PointKernel
  • Forward-merge branch-22.08 to branch-22.10 (#11608) @bdice
  • Use stream in Java API. (#11601) @bdice
  • Refactors of public/detail APIs, CUDFFUNCRANGE, stream handling. (#11600) @bdice
  • Improve ORC writer benchmark with nvbench (#11598) @PointKernel
  • Tune multibyte_split kernel (#11587) @upsj
  • Move split_utils.cuh to strings/detail (#11585) @davidwendt
  • Fix warnings due to compiler regression with if constexpr (#11581) @ttnghia
  • Add full 24-bit dictionary support to Parquet writer (#11580) @etseidl
  • Expose "explicit-comms" option in shuffle-based dask_cudf functions (#11576) @rjzamora
  • Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
  • Refactor daskcudf groupby to use applyconcat_apply (#11571) @rjzamora
  • Add ability to write list(struct) columns as map type in orc writer (#11568) @galipremsagar
  • Add byterange to multibytesplit benchmark + NVBench refactor (#11562) @upsj
  • JNI support for writing binary columns in parquet (#11556) @revans2
  • Support additional dictionary bit widths in Parquet writer (#11547) @etseidl
  • Refactor string/numeric conversion utilities (#11545) @davidwendt
  • Removing unnecessary asserts in parquet tests (#11544) @hyperbolic2346
  • Clean up ORC reader benchmarks with NVBench (#11543) @PointKernel
  • Reuse MurmurHash3_32 in Parquet page data. (#11528) @bdice
  • Add hexadecimal value separators (#11527) @bdice
  • Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
  • Struct support for NULL_EQUALS binary operation (#11520) @rwlee
  • Bump hadoop-common from 3.2.3 to 3.2.4 in /java (#11516) @dependabot[bot]
  • Fix Feather test warning. (#11511) @bdice
  • copyrange ballotsyncs to have no execution dependency (#11508) @robertmaynard
  • Upgrade to arrow-9.x (#11507) @galipremsagar
  • Remove support for skiprows / numrows options in the parquet reader. (#11503) @nvdbaranec
  • Single-pass multibyte_split (#11500) @upsj
  • Sanitize percentile_approx() output for empty input (#11498) @SrikarVanavasam
  • Unpin dask and distributed for development (#11492) @galipremsagar
  • Move SparkMurmurHash3_32 functor. (#11489) @bdice
  • Refactor group_nunique.cu to use nullate::DYNAMIC for reduce-by-key functor (#11482) @davidwendt
  • Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
  • Add reduction distinct_count benchmark (#11473) @ttnghia
  • Add groupby nunique aggregation benchmark (#11472) @ttnghia
  • Disable Arrow S3 support by default. (#11470) @bdice
  • Add groupby max aggregation benchmark (#11464) @ttnghia
  • Extract Dremel encoding code from Parquet (#11461) @vyasr
  • Add missing Thrust #includes. (#11457) @bdice
  • Make CMake hooks verbose (#11456) @vyasr
  • Control Parquet page size through Python API (#11454) @etseidl
  • Add control of Parquet column index creation to python (#11453) @etseidl
  • Remove unused is_struct trait. (#11450) @bdice
  • Refactor the Buffer class (#11447) @madsbk
  • Refactor padside and striptype enums into side_type enum (#11438) @davidwendt
  • Update to Thrust 1.17.0 (#11437) @bdice
  • Add in JNI for parsing JSON data and getting the metadata back too. (#11431) @revans2
  • Convert bytearrayview to use std::byte (#11424) @hyperbolic2346
  • Deprecate unflattennestedcolumns (#11421) @SrikarVanavasam
  • Remove HASHSERIALMURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
  • Add Spark list hashing Java tests (#11379) @bdice
  • Move cmake to the build section. (#11376) @vyasr
  • Remove use of CUDA driver API calls from libcudf (#11370) @shwina
  • Add column constructor from device_uvector&& (#11356) @SrikarVanavasam
  • Remove unused custreamz thirdparty directory (#11343) @vyasr
  • Update jni version to 22.10.0-SNAPSHOT (#11338) @pxLi
  • Enable using upstream jitify2 (#11287) @shwina
  • Cache cudf.Scalar (#11246) @shwina
  • Remove deprecated Series.applymap. (#11031) @bdice
  • Remove deprecated expand parameter from str.findall. (#11030) @bdice

- C++
Published by rapids-bot[bot] over 3 years ago

https://github.com/rapidsai/cudf - v22.10.01

🚨 Breaking Changes

  • Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
  • Disable nvCOMP DEFLATE integration (#11811) @vuule
  • Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
  • Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
  • Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
  • Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
  • Update zfill to match Python output (#11634) @davidwendt
  • Upgrade pandas to 1.5 (#11617) @galipremsagar
  • Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
  • Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
  • Adding optional parquet reader schema (#11524) @hyperbolic2346
  • Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
  • Remove support for skiprows / numrows options in the parquet reader. (#11503) @nvdbaranec
  • Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
  • Disable Arrow S3 support by default. (#11470) @bdice
  • Convert thrust::optional usages to std::optional (#11455) @robertmaynard
  • Remove unused is_struct trait. (#11450) @bdice
  • Refactor the Buffer class (#11447) @madsbk
  • Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
  • Refactor padside and striptype enums into side_type enum (#11438) @davidwendt
  • Remove HASHSERIALMURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
  • Use the new JSON parser when the experimental reader is selected (#11364) @vuule
  • Remove deprecated Series.applymap. (#11031) @bdice
  • Remove deprecated expand parameter from str.findall. (#11030) @bdice

πŸ› Bug Fixes

  • Update cuda-python dependency to 11.7.1 (#11994) @shwina
  • Fixes bug in temporary decompression space estimation before calling nvcomp (#11879) @abellina
  • Handle ptx file paths during strings_udf import (#11862) @galipremsagar
  • Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
  • Reset strings_udf CEC and solve several related issues (#11846) @brandon-b-miller
  • Fix bug in new shuffle-based groupby implementation (#11836) @rjzamora
  • Fix is_valid checks in Scalar._binaryop (#11818) @wence-
  • Fix operator NotImplemented issue with numpy (#11816) @galipremsagar
  • Disable nvCOMP DEFLATE integration (#11811) @vuule
  • Build strings_udf package with other python packages in nightlies (#11808) @brandon-b-miller
  • Revert problematic shuffle=explicit-comms changes (#11803) @rjzamora
  • Fix regex out-of-bounds write in strided rows logic (#11797) @davidwendt
  • Build cudf locally before building strings_udf conda packages in CI (#11785) @brandon-b-miller
  • Fix an issue in cudf::rowbitcount involving structs and lists at multiple levels. (#11779) @nvdbaranec
  • Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
  • Fix issue with set-item incase of list and struct types (#11760) @galipremsagar
  • Ensure all libcudf APIs run on cudf's default stream (#11759) @vyasr
  • Resolve dask_cudf failures caused by upstream groupby changes (#11755) @rjzamora
  • Fix ORC string sum statistics (#11740) @vuule
  • Add strings_udf package for python 3.9 (#11730) @brandon-b-miller
  • Ensure that all tests launch kernels on cudf's default stream (#11726) @vyasr
  • Don't assume stream is a compile-time constant expression (#11725) @vyasr
  • Fix get_thrust.cmake format at patch command (#11715) @davidwendt
  • Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
  • Fix cudf::lists::sort_lists for NaN and Infinity values (#11703) @davidwendt
  • Modify ORC reader timestamp parsing to match the apache reader behavior (#11699) @vuule
  • Fix DataFrame.from_arrow to preserve type metadata (#11698) @galipremsagar
  • Fix compile error due to missing header (#11697) @ttnghia
  • Default to Snappy compression in to_orc when using cuDF or Dask (#11690) @vuule
  • Fix an issue related to Multindex when group_keys=True (#11689) @galipremsagar
  • Transfer correct dtype to exploded column (#11687) @wence-
  • Ignore protobuf generated files in mypy checks (#11685) @galipremsagar
  • Maintain the index name after .loc (#11677) @shwina
  • Fix issue with extracting nested column data & dtype preservation (#11671) @galipremsagar
  • Ensure that all cudf tests and benchmarks are conda env aware (#11666) @robertmaynard
  • Update to Thrust 1.17.2 to fix cub ODR issues (#11665) @robertmaynard
  • Fix multi-file remote datasource bug (#11655) @rjzamora
  • Fix invalid regex quantifier check to not include alternation (#11654) @davidwendt
  • Fix bug in device_write(): it uses an incorrect size (#11651) @madsbk
  • fixes overflows in benchmarks (#11649) @elstehle
  • Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
  • Fix compile error in benchmark nested_json.cpp (#11637) @davidwendt
  • Update zfill to match Python output (#11634) @davidwendt
  • Removed converted type for INT32 and INT64 since they do not convert (#11627) @hyperbolic2346
  • Fix host scalars construction of nested types (#11612) @galipremsagar
  • Fix compile warning in nestedjsongpu.cu (#11607) @davidwendt
  • Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
  • Preserve order if necessary when deduping categoricals internally (#11597) @brandon-b-miller
  • Add is_timestamp test for leap second (60) (#11594) @davidwendt
  • Fix an issue with to_arrow when column name type is not a string (#11590) @galipremsagar
  • Fix exception in segmented-reduce benchmark (#11588) @davidwendt
  • Fix encode/decode of negative timestamps in ORC reader/writer (#11586) @vuule
  • Correct distribution data type in quantiles benchmark (#11584) @vuule
  • Fix multibyte_split benchmark for host buffers (#11583) @upsj
  • xfail custreamz display test for now (#11567) @shwina
  • Fix JNI for TableWithMeta to use schemainfo instead of columnnames (#11566) @jlowe
  • Reduce code duplication for dask & distributed nightly/stable installs (#11565) @galipremsagar
  • Fix groupby failures in dask_cudf CI (#11561) @rjzamora
  • Fix for pivot: error when 'values' is a multicharacter string (#11538) @shaswat-indian
  • find_package(cudf) + arrow9 usable with cudf build directory (#11535) @robertmaynard
  • Fixing crash when writing binary nested data in parquet (#11526) @hyperbolic2346
  • Fix for: error when assigning a value to an empty series (#11523) @shaswat-indian
  • Fix invalid results from conditional-left-anti-join in debug build (#11517) @davidwendt
  • Fix cmake error after upgrading to Arrow 9 (#11513) @ttnghia
  • Fix reverse binary operators acting on a host value and cudf.Scalar (#11512) @bdice
  • Update parquet fuzz tests to drop support for skiprows & num_rows (#11505) @galipremsagar
  • Use rapids-cmake 22.10 best practice for RAPIDS.cmake location (#11493) @robertmaynard
  • Handle some zero-sized corner cases in dlpack interop (#11449) @wence-
  • Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
  • libcudf c++ example updated to CPM version 0.35.3 (#11417) @robertmaynard
  • Fix regex quantifier check to include capture groups (#11373) @davidwendt
  • Fix readtext when byterange is aligned with field (#11371) @upsj
  • Fix to_timestamps truncated subsecond calculation (#11367) @davidwendt
  • column: calculate null_count before release()ing the cudf::column (#11365) @wence-

πŸ“– Documentation

  • Update guide-to-udfs notebook (#11861) @brandon-b-miller
  • Update docstring for cudf.read_text (#11799) @GregoryKimball
  • Add doc section for list & struct handling (#11770) @galipremsagar
  • Document that minimum required CMake version is now 3.23.1 (#11751) @robertmaynard
  • Update libcudf documentation build command in DOCUMENTATION.md (#11735) @davidwendt
  • Add docs for use of string data to DataFrame.apply and Series.apply and update guide to UDFs notebook (#11733) @brandon-b-miller
  • Enable more Pydocstyle rules (#11582) @bdice
  • Remove unused cpp/img folder (#11554) @davidwendt
  • Publish C++ developer docs (#11475) @vyasr
  • Fix a misalignment in cudf.get_dummies docstring (#11443) @galipremsagar
  • Update contributing doc to include links to the developer guides (#11390) @davidwendt
  • Fix tableviewbase doxygen format (#11340) @davidwendt
  • Create main developer guide for Python (#11235) @vyasr
  • Add developer documentation for benchmarking (#11122) @vyasr
  • cuDF error handling document (#7917) @isVoid

πŸš€ New Features

  • Add hasNull statistic reading ability to ORC (#11747) @devavret
  • Add istitle to string UDFs (#11738) @brandon-b-miller
  • JSON Column creation in GPU (#11714) @karthikeyann
  • Adds option to take explicit nested schema for nested JSON reader (#11682) @elstehle
  • Add BGZIP data_chunk_reader (#11652) @upsj
  • Support DECIMAL order-by for RANGE window functions (#11645) @mythrocks
  • changing version of cmake to 3.23.3 (#11619) @hyperbolic2346
  • Generate unique keys table in java JNI contiguousSplitGroups (#11614) @res-life
  • Generic type casting to support the new nested JSON reader (#11613) @elstehle
  • JSON tree traversal (#11610) @karthikeyann
  • Add casting operators to masked UDFs (#11578) @brandon-b-miller
  • Adds type inference and type conversion for leaf-columns to the nested JSON parser (#11574) @elstehle
  • Add strings 'like' function (#11558) @davidwendt
  • Handle hyphen as literal for regex cclass when incomplete range (#11557) @davidwendt
  • Enable ZSTD compression in ORC and Parquet writers (#11551) @vuule
  • Adds support for json lines format to the nested JSON reader (#11534) @elstehle
  • Adding optional parquet reader schema (#11524) @hyperbolic2346
  • Adds GPU implementation of JSON-token-stream to JSON-tree (#11518) @karthikeyann
  • Add gdb pretty-printers for simple types (#11499) @upsj
  • Add create_random_column function to the data generator (#11490) @vuule
  • Add fluent API builder to data_profile (#11479) @vuule
  • Adds Nested Json benchmark (#11466) @karthikeyann
  • Convert thrust::optional usages to std::optional (#11455) @robertmaynard
  • Python API for the future experimental JSON reader (#11426) @vuule
  • Return schema info from JSON reader (#11419) @vuule
  • Add regex ASCII flag support for matching builtin character classes (#11404) @davidwendt
  • Truncate parquet column indexes (#11403) @etseidl
  • Adds the end-to-end JSON parser implementation (#11388) @elstehle
  • Use the new JSON parser when the experimental reader is selected (#11364) @vuule
  • Add placeholder for the experimental JSON reader (#11334) @vuule
  • Add read-only functions on string dtypes to DataFrame.apply and Series.apply (#11319) @brandon-b-miller
  • Added 'crosstab' and 'pivot_table' features (#11314) @shaswat-indian
  • Quickly error out when trying to build with unsupported nvcc versions (#11297) @robertmaynard
  • Adds JSON tokenizer (#11264) @elstehle
  • List lexicographic comparator (#11129) @devavret
  • Add generic type inference for cuIO (#11121) @PointKernel
  • Fully support nested types in cudf::contains (#10656) @ttnghia
  • Support nested types in lists::contains (#10548) @ttnghia

πŸ› οΈ Improvements

  • Pin dask and distributed for release (#11822) @galipremsagar
  • Add examples for Nested JSON reader (#11814) @GregoryKimball
  • Support shuffle-based groupby aggregations in dask_cudf (#11800) @rjzamora
  • Update strings udf version updater script (#11772) @galipremsagar
  • Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
  • Pass dtype param to avoid pd.Series warnings (#11761) @galipremsagar
  • Enable schema_element & keep_quotes support in json reader (#11746) @galipremsagar
  • Add ability to construct ListColumn when size is None (#11745) @galipremsagar
  • Reduces memory requirements in JSON parser and adds bytes/s and peak memory usage to benchmarks (#11732) @elstehle
  • Add missing copyright headers. (#11712) @bdice
  • Fix copyright check issues in pre-commit (#11711) @bdice
  • Include decimal in supported types for range window order-by columns (#11710) @mythrocks
  • Disable very large column gtest for contiguous-split (#11706) @davidwendt
  • Drop split_out=None test from groupby.agg (#11704) @wence-
  • Use CubinLinker for CUDA Minor Version Compatibility (#11701) @gmarkall
  • Add regex capture-group parameter to auto convert to non-capture groups (#11695) @davidwendt
  • Add a __dataframe__ method to the protocol dataframe object (#11692) @rgommers
  • Special-case multibyte_split for single-byte delimiter (#11681) @upsj
  • Remove isort exclusions (#11680) @bdice
  • Refactor CSV reader benchmarks with nvbench (#11678) @PointKernel
  • Check conda recipe headers with pre-commit (#11669) @bdice
  • Remove redundant style check for clang-format. (#11668) @bdice
  • Add support for group_keys in groupby (#11659) @galipremsagar
  • Fix pandoc pinning. (#11658) @bdice
  • Revert removal of skiprows / numrows options from the Parquet reader. (#11657) @nvdbaranec
  • Update git metadata (#11647) @bdice
  • Call setnullcount on a returning column if null-count is known (#11646) @davidwendt
  • Fix some libcudf detail calls not passing the stream variable (#11642) @davidwendt
  • Update to mypy 0.971 (#11640) @wence-
  • Refactor strings strip functor to details header (#11635) @davidwendt
  • Fix incorrect nullCount in get_json_object (#11633) @trxcllnt
  • Simplify hostdevice_vector (#11631) @upsj
  • Refactor parquet writer benchmarks with nvbench (#11623) @PointKernel
  • Rework contains_scalar to check nulls at runtime (#11622) @davidwendt
  • Fix incorrect memory resource used in rolling temp columns (#11618) @mythrocks
  • Upgrade pandas to 1.5 (#11617) @galipremsagar
  • Move type-dispatcher calls from traits.hpp to traits.cpp (#11616) @davidwendt
  • Refactor parquet reader benchmarks with nvbench (#11611) @PointKernel
  • Forward-merge branch-22.08 to branch-22.10 (#11608) @bdice
  • Use stream in Java API. (#11601) @bdice
  • Refactors of public/detail APIs, CUDFFUNCRANGE, stream handling. (#11600) @bdice
  • Improve ORC writer benchmark with nvbench (#11598) @PointKernel
  • Tune multibyte_split kernel (#11587) @upsj
  • Move split_utils.cuh to strings/detail (#11585) @davidwendt
  • Fix warnings due to compiler regression with if constexpr (#11581) @ttnghia
  • Add full 24-bit dictionary support to Parquet writer (#11580) @etseidl
  • Expose "explicit-comms" option in shuffle-based dask_cudf functions (#11576) @rjzamora
  • Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
  • Refactor daskcudf groupby to use applyconcat_apply (#11571) @rjzamora
  • Add ability to write list(struct) columns as map type in orc writer (#11568) @galipremsagar
  • Add byterange to multibytesplit benchmark + NVBench refactor (#11562) @upsj
  • JNI support for writing binary columns in parquet (#11556) @revans2
  • Support additional dictionary bit widths in Parquet writer (#11547) @etseidl
  • Refactor string/numeric conversion utilities (#11545) @davidwendt
  • Removing unnecessary asserts in parquet tests (#11544) @hyperbolic2346
  • Clean up ORC reader benchmarks with NVBench (#11543) @PointKernel
  • Reuse MurmurHash3_32 in Parquet page data. (#11528) @bdice
  • Add hexadecimal value separators (#11527) @bdice
  • Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
  • Struct support for NULL_EQUALS binary operation (#11520) @rwlee
  • Bump hadoop-common from 3.2.3 to 3.2.4 in /java (#11516) @dependabot[bot]
  • Fix Feather test warning. (#11511) @bdice
  • copyrange ballotsyncs to have no execution dependency (#11508) @robertmaynard
  • Upgrade to arrow-9.x (#11507) @galipremsagar
  • Remove support for skiprows / numrows options in the parquet reader. (#11503) @nvdbaranec
  • Single-pass multibyte_split (#11500) @upsj
  • Sanitize percentile_approx() output for empty input (#11498) @SrikarVanavasam
  • Unpin dask and distributed for development (#11492) @galipremsagar
  • Move SparkMurmurHash3_32 functor. (#11489) @bdice
  • Refactor group_nunique.cu to use nullate::DYNAMIC for reduce-by-key functor (#11482) @davidwendt
  • Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
  • Add reduction distinct_count benchmark (#11473) @ttnghia
  • Add groupby nunique aggregation benchmark (#11472) @ttnghia
  • Disable Arrow S3 support by default. (#11470) @bdice
  • Add groupby max aggregation benchmark (#11464) @ttnghia
  • Extract Dremel encoding code from Parquet (#11461) @vyasr
  • Add missing Thrust #includes. (#11457) @bdice
  • Make CMake hooks verbose (#11456) @vyasr
  • Control Parquet page size through Python API (#11454) @etseidl
  • Add control of Parquet column index creation to python (#11453) @etseidl
  • Remove unused is_struct trait. (#11450) @bdice
  • Refactor the Buffer class (#11447) @madsbk
  • Refactor padside and striptype enums into side_type enum (#11438) @davidwendt
  • Update to Thrust 1.17.0 (#11437) @bdice
  • Add in JNI for parsing JSON data and getting the metadata back too. (#11431) @revans2
  • Convert bytearrayview to use std::byte (#11424) @hyperbolic2346
  • Deprecate unflattennestedcolumns (#11421) @SrikarVanavasam
  • Remove HASHSERIALMURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
  • Add Spark list hashing Java tests (#11379) @bdice
  • Move cmake to the build section. (#11376) @vyasr
  • Remove use of CUDA driver API calls from libcudf (#11370) @shwina
  • Add column constructor from device_uvector&& (#11356) @SrikarVanavasam
  • Remove unused custreamz thirdparty directory (#11343) @vyasr
  • Update jni version to 22.10.0-SNAPSHOT (#11338) @pxLi
  • Enable using upstream jitify2 (#11287) @shwina
  • Cache cudf.Scalar (#11246) @shwina
  • Remove deprecated Series.applymap. (#11031) @bdice
  • Remove deprecated expand parameter from str.findall. (#11030) @bdice

- C++
Published by GPUtester over 3 years ago

https://github.com/rapidsai/cudf - v22.10.00

🚨 Breaking Changes

  • Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
  • Disable nvCOMP DEFLATE integration (#11811) @vuule
  • Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
  • Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
  • Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
  • Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
  • Update zfill to match Python output (#11634) @davidwendt
  • Upgrade pandas to 1.5 (#11617) @galipremsagar
  • Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
  • Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
  • Adding optional parquet reader schema (#11524) @hyperbolic2346
  • Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
  • Remove support for skiprows / numrows options in the parquet reader. (#11503) @nvdbaranec
  • Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
  • Disable Arrow S3 support by default. (#11470) @bdice
  • Convert thrust::optional usages to std::optional (#11455) @robertmaynard
  • Remove unused is_struct trait. (#11450) @bdice
  • Refactor the Buffer class (#11447) @madsbk
  • Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
  • Refactor padside and striptype enums into side_type enum (#11438) @davidwendt
  • Remove HASHSERIALMURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
  • Use the new JSON parser when the experimental reader is selected (#11364) @vuule
  • Remove deprecated Series.applymap. (#11031) @bdice
  • Remove deprecated expand parameter from str.findall. (#11030) @bdice

πŸ› Bug Fixes

  • Fixes bug in temporary decompression space estimation before calling nvcomp (#11879) @abellina
  • Handle ptx file paths during strings_udf import (#11862) @galipremsagar
  • Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
  • Reset strings_udf CEC and solve several related issues (#11846) @brandon-b-miller
  • Fix bug in new shuffle-based groupby implementation (#11836) @rjzamora
  • Fix is_valid checks in Scalar._binaryop (#11818) @wence-
  • Fix operator NotImplemented issue with numpy (#11816) @galipremsagar
  • Disable nvCOMP DEFLATE integration (#11811) @vuule
  • Build strings_udf package with other python packages in nightlies (#11808) @brandon-b-miller
  • Revert problematic shuffle=explicit-comms changes (#11803) @rjzamora
  • Fix regex out-of-bounds write in strided rows logic (#11797) @davidwendt
  • Build cudf locally before building strings_udf conda packages in CI (#11785) @brandon-b-miller
  • Fix an issue in cudf::rowbitcount involving structs and lists at multiple levels. (#11779) @nvdbaranec
  • Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
  • Fix issue with set-item incase of list and struct types (#11760) @galipremsagar
  • Ensure all libcudf APIs run on cudf's default stream (#11759) @vyasr
  • Resolve dask_cudf failures caused by upstream groupby changes (#11755) @rjzamora
  • Fix ORC string sum statistics (#11740) @vuule
  • Add strings_udf package for python 3.9 (#11730) @brandon-b-miller
  • Ensure that all tests launch kernels on cudf's default stream (#11726) @vyasr
  • Don't assume stream is a compile-time constant expression (#11725) @vyasr
  • Fix get_thrust.cmake format at patch command (#11715) @davidwendt
  • Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
  • Fix cudf::lists::sort_lists for NaN and Infinity values (#11703) @davidwendt
  • Modify ORC reader timestamp parsing to match the apache reader behavior (#11699) @vuule
  • Fix DataFrame.from_arrow to preserve type metadata (#11698) @galipremsagar
  • Fix compile error due to missing header (#11697) @ttnghia
  • Default to Snappy compression in to_orc when using cuDF or Dask (#11690) @vuule
  • Fix an issue related to Multindex when group_keys=True (#11689) @galipremsagar
  • Transfer correct dtype to exploded column (#11687) @wence-
  • Ignore protobuf generated files in mypy checks (#11685) @galipremsagar
  • Maintain the index name after .loc (#11677) @shwina
  • Fix issue with extracting nested column data & dtype preservation (#11671) @galipremsagar
  • Ensure that all cudf tests and benchmarks are conda env aware (#11666) @robertmaynard
  • Update to Thrust 1.17.2 to fix cub ODR issues (#11665) @robertmaynard
  • Fix multi-file remote datasource bug (#11655) @rjzamora
  • Fix invalid regex quantifier check to not include alternation (#11654) @davidwendt
  • Fix bug in device_write(): it uses an incorrect size (#11651) @madsbk
  • fixes overflows in benchmarks (#11649) @elstehle
  • Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
  • Fix compile error in benchmark nested_json.cpp (#11637) @davidwendt
  • Update zfill to match Python output (#11634) @davidwendt
  • Removed converted type for INT32 and INT64 since they do not convert (#11627) @hyperbolic2346
  • Fix host scalars construction of nested types (#11612) @galipremsagar
  • Fix compile warning in nestedjsongpu.cu (#11607) @davidwendt
  • Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
  • Preserve order if necessary when deduping categoricals internally (#11597) @brandon-b-miller
  • Add is_timestamp test for leap second (60) (#11594) @davidwendt
  • Fix an issue with to_arrow when column name type is not a string (#11590) @galipremsagar
  • Fix exception in segmented-reduce benchmark (#11588) @davidwendt
  • Fix encode/decode of negative timestamps in ORC reader/writer (#11586) @vuule
  • Correct distribution data type in quantiles benchmark (#11584) @vuule
  • Fix multibyte_split benchmark for host buffers (#11583) @upsj
  • xfail custreamz display test for now (#11567) @shwina
  • Fix JNI for TableWithMeta to use schemainfo instead of columnnames (#11566) @jlowe
  • Reduce code duplication for dask & distributed nightly/stable installs (#11565) @galipremsagar
  • Fix groupby failures in dask_cudf CI (#11561) @rjzamora
  • Fix for pivot: error when 'values' is a multicharacter string (#11538) @shaswat-indian
  • find_package(cudf) + arrow9 usable with cudf build directory (#11535) @robertmaynard
  • Fixing crash when writing binary nested data in parquet (#11526) @hyperbolic2346
  • Fix for: error when assigning a value to an empty series (#11523) @shaswat-indian
  • Fix invalid results from conditional-left-anti-join in debug build (#11517) @davidwendt
  • Fix cmake error after upgrading to Arrow 9 (#11513) @ttnghia
  • Fix reverse binary operators acting on a host value and cudf.Scalar (#11512) @bdice
  • Update parquet fuzz tests to drop support for skiprows & num_rows (#11505) @galipremsagar
  • Use rapids-cmake 22.10 best practice for RAPIDS.cmake location (#11493) @robertmaynard
  • Handle some zero-sized corner cases in dlpack interop (#11449) @wence-
  • Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
  • libcudf c++ example updated to CPM version 0.35.3 (#11417) @robertmaynard
  • Fix regex quantifier check to include capture groups (#11373) @davidwendt
  • Fix readtext when byterange is aligned with field (#11371) @upsj
  • Fix to_timestamps truncated subsecond calculation (#11367) @davidwendt
  • column: calculate null_count before release()ing the cudf::column (#11365) @wence-

πŸ“– Documentation

  • Update guide-to-udfs notebook (#11861) @brandon-b-miller
  • Update docstring for cudf.read_text (#11799) @GregoryKimball
  • Add doc section for list & struct handling (#11770) @galipremsagar
  • Document that minimum required CMake version is now 3.23.1 (#11751) @robertmaynard
  • Update libcudf documentation build command in DOCUMENTATION.md (#11735) @davidwendt
  • Add docs for use of string data to DataFrame.apply and Series.apply and update guide to UDFs notebook (#11733) @brandon-b-miller
  • Enable more Pydocstyle rules (#11582) @bdice
  • Remove unused cpp/img folder (#11554) @davidwendt
  • Publish C++ developer docs (#11475) @vyasr
  • Fix a misalignment in cudf.get_dummies docstring (#11443) @galipremsagar
  • Update contributing doc to include links to the developer guides (#11390) @davidwendt
  • Fix tableviewbase doxygen format (#11340) @davidwendt
  • Create main developer guide for Python (#11235) @vyasr
  • Add developer documentation for benchmarking (#11122) @vyasr
  • cuDF error handling document (#7917) @isVoid

πŸš€ New Features

  • Add hasNull statistic reading ability to ORC (#11747) @devavret
  • Add istitle to string UDFs (#11738) @brandon-b-miller
  • JSON Column creation in GPU (#11714) @karthikeyann
  • Adds option to take explicit nested schema for nested JSON reader (#11682) @elstehle
  • Add BGZIP data_chunk_reader (#11652) @upsj
  • Support DECIMAL order-by for RANGE window functions (#11645) @mythrocks
  • changing version of cmake to 3.23.3 (#11619) @hyperbolic2346
  • Generate unique keys table in java JNI contiguousSplitGroups (#11614) @res-life
  • Generic type casting to support the new nested JSON reader (#11613) @elstehle
  • JSON tree traversal (#11610) @karthikeyann
  • Add casting operators to masked UDFs (#11578) @brandon-b-miller
  • Adds type inference and type conversion for leaf-columns to the nested JSON parser (#11574) @elstehle
  • Add strings 'like' function (#11558) @davidwendt
  • Handle hyphen as literal for regex cclass when incomplete range (#11557) @davidwendt
  • Enable ZSTD compression in ORC and Parquet writers (#11551) @vuule
  • Adds support for json lines format to the nested JSON reader (#11534) @elstehle
  • Adding optional parquet reader schema (#11524) @hyperbolic2346
  • Adds GPU implementation of JSON-token-stream to JSON-tree (#11518) @karthikeyann
  • Add gdb pretty-printers for simple types (#11499) @upsj
  • Add create_random_column function to the data generator (#11490) @vuule
  • Add fluent API builder to data_profile (#11479) @vuule
  • Adds Nested Json benchmark (#11466) @karthikeyann
  • Convert thrust::optional usages to std::optional (#11455) @robertmaynard
  • Python API for the future experimental JSON reader (#11426) @vuule
  • Return schema info from JSON reader (#11419) @vuule
  • Add regex ASCII flag support for matching builtin character classes (#11404) @davidwendt
  • Truncate parquet column indexes (#11403) @etseidl
  • Adds the end-to-end JSON parser implementation (#11388) @elstehle
  • Use the new JSON parser when the experimental reader is selected (#11364) @vuule
  • Add placeholder for the experimental JSON reader (#11334) @vuule
  • Add read-only functions on string dtypes to DataFrame.apply and Series.apply (#11319) @brandon-b-miller
  • Added 'crosstab' and 'pivot_table' features (#11314) @shaswat-indian
  • Quickly error out when trying to build with unsupported nvcc versions (#11297) @robertmaynard
  • Adds JSON tokenizer (#11264) @elstehle
  • List lexicographic comparator (#11129) @devavret
  • Add generic type inference for cuIO (#11121) @PointKernel
  • Fully support nested types in cudf::contains (#10656) @ttnghia
  • Support nested types in lists::contains (#10548) @ttnghia

πŸ› οΈ Improvements

  • Pin dask and distributed for release (#11822) @galipremsagar
  • Add examples for Nested JSON reader (#11814) @GregoryKimball
  • Support shuffle-based groupby aggregations in dask_cudf (#11800) @rjzamora
  • Update strings udf version updater script (#11772) @galipremsagar
  • Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
  • Pass dtype param to avoid pd.Series warnings (#11761) @galipremsagar
  • Enable schema_element & keep_quotes support in json reader (#11746) @galipremsagar
  • Add ability to construct ListColumn when size is None (#11745) @galipremsagar
  • Reduces memory requirements in JSON parser and adds bytes/s and peak memory usage to benchmarks (#11732) @elstehle
  • Add missing copyright headers. (#11712) @bdice
  • Fix copyright check issues in pre-commit (#11711) @bdice
  • Include decimal in supported types for range window order-by columns (#11710) @mythrocks
  • Disable very large column gtest for contiguous-split (#11706) @davidwendt
  • Drop split_out=None test from groupby.agg (#11704) @wence-
  • Use CubinLinker for CUDA Minor Version Compatibility (#11701) @gmarkall
  • Add regex capture-group parameter to auto convert to non-capture groups (#11695) @davidwendt
  • Add a __dataframe__ method to the protocol dataframe object (#11692) @rgommers
  • Special-case multibyte_split for single-byte delimiter (#11681) @upsj
  • Remove isort exclusions (#11680) @bdice
  • Refactor CSV reader benchmarks with nvbench (#11678) @PointKernel
  • Check conda recipe headers with pre-commit (#11669) @bdice
  • Remove redundant style check for clang-format. (#11668) @bdice
  • Add support for group_keys in groupby (#11659) @galipremsagar
  • Fix pandoc pinning. (#11658) @bdice
  • Revert removal of skiprows / numrows options from the Parquet reader. (#11657) @nvdbaranec
  • Update git metadata (#11647) @bdice
  • Call setnullcount on a returning column if null-count is known (#11646) @davidwendt
  • Fix some libcudf detail calls not passing the stream variable (#11642) @davidwendt
  • Update to mypy 0.971 (#11640) @wence-
  • Refactor strings strip functor to details header (#11635) @davidwendt
  • Fix incorrect nullCount in get_json_object (#11633) @trxcllnt
  • Simplify hostdevice_vector (#11631) @upsj
  • Refactor parquet writer benchmarks with nvbench (#11623) @PointKernel
  • Rework contains_scalar to check nulls at runtime (#11622) @davidwendt
  • Fix incorrect memory resource used in rolling temp columns (#11618) @mythrocks
  • Upgrade pandas to 1.5 (#11617) @galipremsagar
  • Move type-dispatcher calls from traits.hpp to traits.cpp (#11616) @davidwendt
  • Refactor parquet reader benchmarks with nvbench (#11611) @PointKernel
  • Forward-merge branch-22.08 to branch-22.10 (#11608) @bdice
  • Use stream in Java API. (#11601) @bdice
  • Refactors of public/detail APIs, CUDFFUNCRANGE, stream handling. (#11600) @bdice
  • Improve ORC writer benchmark with nvbench (#11598) @PointKernel
  • Tune multibyte_split kernel (#11587) @upsj
  • Move split_utils.cuh to strings/detail (#11585) @davidwendt
  • Fix warnings due to compiler regression with if constexpr (#11581) @ttnghia
  • Add full 24-bit dictionary support to Parquet writer (#11580) @etseidl
  • Expose "explicit-comms" option in shuffle-based dask_cudf functions (#11576) @rjzamora
  • Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
  • Refactor daskcudf groupby to use applyconcat_apply (#11571) @rjzamora
  • Add ability to write list(struct) columns as map type in orc writer (#11568) @galipremsagar
  • Add byterange to multibytesplit benchmark + NVBench refactor (#11562) @upsj
  • JNI support for writing binary columns in parquet (#11556) @revans2
  • Support additional dictionary bit widths in Parquet writer (#11547) @etseidl
  • Refactor string/numeric conversion utilities (#11545) @davidwendt
  • Removing unnecessary asserts in parquet tests (#11544) @hyperbolic2346
  • Clean up ORC reader benchmarks with NVBench (#11543) @PointKernel
  • Reuse MurmurHash3_32 in Parquet page data. (#11528) @bdice
  • Add hexadecimal value separators (#11527) @bdice
  • Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
  • Struct support for NULL_EQUALS binary operation (#11520) @rwlee
  • Bump hadoop-common from 3.2.3 to 3.2.4 in /java (#11516) @dependabot[bot]
  • Fix Feather test warning. (#11511) @bdice
  • copyrange ballotsyncs to have no execution dependency (#11508) @robertmaynard
  • Upgrade to arrow-9.x (#11507) @galipremsagar
  • Remove support for skiprows / numrows options in the parquet reader. (#11503) @nvdbaranec
  • Single-pass multibyte_split (#11500) @upsj
  • Sanitize percentile_approx() output for empty input (#11498) @SrikarVanavasam
  • Unpin dask and distributed for development (#11492) @galipremsagar
  • Move SparkMurmurHash3_32 functor. (#11489) @bdice
  • Refactor group_nunique.cu to use nullate::DYNAMIC for reduce-by-key functor (#11482) @davidwendt
  • Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
  • Add reduction distinct_count benchmark (#11473) @ttnghia
  • Add groupby nunique aggregation benchmark (#11472) @ttnghia
  • Disable Arrow S3 support by default. (#11470) @bdice
  • Add groupby max aggregation benchmark (#11464) @ttnghia
  • Extract Dremel encoding code from Parquet (#11461) @vyasr
  • Add missing Thrust #includes. (#11457) @bdice
  • Make CMake hooks verbose (#11456) @vyasr
  • Control Parquet page size through Python API (#11454) @etseidl
  • Add control of Parquet column index creation to python (#11453) @etseidl
  • Remove unused is_struct trait. (#11450) @bdice
  • Refactor the Buffer class (#11447) @madsbk
  • Refactor padside and striptype enums into side_type enum (#11438) @davidwendt
  • Update to Thrust 1.17.0 (#11437) @bdice
  • Add in JNI for parsing JSON data and getting the metadata back too. (#11431) @revans2
  • Convert bytearrayview to use std::byte (#11424) @hyperbolic2346
  • Deprecate unflattennestedcolumns (#11421) @SrikarVanavasam
  • Remove HASHSERIALMURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
  • Add Spark list hashing Java tests (#11379) @bdice
  • Move cmake to the build section. (#11376) @vyasr
  • Remove use of CUDA driver API calls from libcudf (#11370) @shwina
  • Add column constructor from device_uvector&& (#11356) @SrikarVanavasam
  • Remove unused custreamz thirdparty directory (#11343) @vyasr
  • Update jni version to 22.10.0-SNAPSHOT (#11338) @pxLi
  • Enable using upstream jitify2 (#11287) @shwina
  • Cache cudf.Scalar (#11246) @shwina
  • Remove deprecated Series.applymap. (#11031) @bdice
  • Remove deprecated expand parameter from str.findall. (#11030) @bdice

- C++
Published by GPUtester over 3 years ago

https://github.com/rapidsai/cudf - v22.08.01

🚨 Breaking Changes

  • Pin numpy to &lt;1.23 (#11824) @galipremsagar
  • Remove legacy join APIs (#11274) @vyasr
  • Remove lists::drop_list_duplicates (#11236) @ttnghia
  • Remove Index.replace API (#11131) @vyasr
  • Remove deprecated Index methods from Frame (#11073) @vyasr
  • Remove public API of cudf.merge_sorted. (#11032) @bdice
  • Drop python 3.7 in code-base (#11029) @galipremsagar
  • Return empty dataframe when reading a Parquet file using empty columns option (#11018) @vuule
  • Remove Arrow CUDA IPC code (#10995) @shwina
  • Buffer: make .ptr read-only (#10872) @madsbk

πŸ› Bug Fixes

  • Fix out-of-bound access in cudf::detail::label_segments (#11497) @ttnghia
  • Fix distributed error related to loop_in_thread (#11428) @galipremsagar
  • Fix atomic operations on NaN values (#11420) @ttnghia
  • Relax arrow pinning to just 8.x and remove cuda build dependency from cudf recipe (#11412) @kkraus14
  • Revert "Allow CuPy 11" (#11409) @jakirkham
  • Fix moto timeouts (#11369) @galipremsagar
  • Set +/-infinity as the identity values for floating-point numbers in device operators min and max (#11357) @ttnghia
  • Fix memory_usage() for ListSeries (#11355) @thomcom
  • Fix constructing Column from column_view with expired mask (#11354) @shwina
  • Handle parquet corner case: Columns with more rows than are in the row group. (#11353) @nvdbaranec
  • Fix DatetimeIndex & TimedeltaIndex constructors (#11342) @galipremsagar
  • Fix unsigned-compare compile warning in IntPow binops (#11339) @davidwendt
  • Fix performance issue and add a new code path to cudf::detail::contains (#11330) @ttnghia
  • Pin pytorch to temporarily unblock from libcupti errors (#11289) @galipremsagar
  • Workaround for nvcomp zstd overwriting blocks for orc due to underestimate of sizes (#11288) @jbrennan333
  • Fix inconsistency when hashing two tables in cudf::detail::contains (#11284) @ttnghia
  • Fix issue related to numpy array and category dtype (#11282) @galipremsagar
  • Add NotImplementedError when on is specified in DataFrame.join. (#11275) @vyasr
  • Fix invalid allocatelike() and emptylike() tests. (#11268) @nvdbaranec
  • Returns DataFrame When Concating Along Axis 1 (#11263) @isVoid
  • Fix compile error due to missing header (#11257) @ttnghia
  • Fix a memory aliasing/crash issue in scatter for lists. (#11254) @nvdbaranec
  • Fix tests/rolling/empty_input_test (#11238) @ttnghia
  • Fix const qualifier when using host_span&lt;bitmask_type const*&gt; (#11220) @ttnghia
  • Avoid using nvcompBatchedDeflateDecompressGetTempSizeEx in cuIO (#11213) @vuule
  • Generate benchmark data with correct run length regardless of cardinality (#11205) @vuule
  • Fix cumulative count index behavior (#11188) @brandon-b-miller
  • Fix assertion in daskcudf teststruct_explode (#11170) @rjzamora
  • Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager (#11161) @res-life
  • Fix compatibility issues with pandas 1.4.3 (#11152) @vyasr
  • Ensure cuco export set is installed in cmake build (#11147) @jlowe
  • Avoid redundant deepcopy in cudf.from_pandas (#11142) @galipremsagar
  • Fix compile error due to missing header (#11126) @ttnghia
  • Fix __cuda_array_interface__ failures (#11113) @galipremsagar
  • Support octal and hex within regex character class pattern (#11112) @davidwendt
  • Fix split_re matching logic for word boundaries (#11106) @davidwendt
  • Handle multiple files metadata in read_parquet (#11105) @galipremsagar
  • Fix index alignment for Series objects with repeated index (#11103) @shwina
  • FindcuFile now searches in the current CUDA Toolkit location (#11101) @robertmaynard
  • Fix regex word boundary logic to include underline (#11099) @davidwendt
  • Exclude CudaFatalTest when selecting all Java tests (#11083) @jlowe
  • Fix duplicate cudatoolkit pinning issue (#11070) @galipremsagar
  • Maintain the input index in the result of a groupby-transform (#11068) @shwina
  • Fix bug with row count comparison for expectcolumnsequivalent(). (#11059) @nvdbaranec
  • Fix BPE uninitialized size value for null and empty input strings (#11054) @davidwendt
  • Include missing header for usage of get_current_device_resource() (#11047) @AtlantaPepsi
  • Fix warnunusedresult error in parquet test (#11026) @karthikeyann
  • Return empty dataframe when reading a Parquet file using empty columns option (#11018) @vuule
  • Fix small error in page row count limiting (#10991) @etseidl
  • Fix a row index entry error in ORC writer issue (#10989) @vuule
  • Fix grouped covariance to require both values to be convertible to double. (#10891) @bdice

πŸ“– Documentation

  • Defer loading of custom.js (#11465) @galipremsagar
  • Fix issues with day & night modes in python docs (#11400) @galipremsagar
  • Update missing data handling APIs in docs (#11345) @galipremsagar
  • Add lists filtering APIs to doxygen group. (#11336) @bdice
  • Remove unused import in README sample (#11318) @vyasr
  • Note null behavior in where docs (#11276) @brandon-b-miller
  • Update docstring for spans in get_row_data_range (#11271) @vyasr
  • Update nvCOMP integration table (#11231) @vuule
  • Add dev docs for documentation writing (#11217) @vyasr
  • Documentation fix for concatenate (#11187) @dagardner-nv
  • Fix unresolved links in markdown (#11173) @karthikeyann
  • Fix cudf version in README.md install commands (#11164) @jvanstraten
  • Switch language from None to &quot;en&quot; in docs build (#11133) @galipremsagar
  • Remove docs mentioning scalar_view since no such class exists. (#11132) @bdice
  • Add docstring entry for DataFrame.value_counts (#11039) @galipremsagar
  • Add docs to rolling var, std, count. (#11035) @bdice
  • Fix docs for Numba UDFs. (#11020) @bdice
  • Replace column comparison utilities functions with macros (#11007) @karthikeyann
  • Fix Doxygen warnings in multiple headers files (#11003) @karthikeyann
  • Fix doxygen warnings in utilities/ headers (#10974) @karthikeyann
  • Fix Doxygen warnings in table header files (#10964) @karthikeyann
  • Fix Doxygen warnings in column header files (#10963) @karthikeyann
  • Fix Doxygen warnings in strings / header files (#10937) @karthikeyann
  • Generate Doxygen Tag File for Libcudf (#10932) @isVoid
  • Fix doxygen warnings in structs, lists headers (#10923) @karthikeyann
  • Fix doxygen warnings in fixed_point.hpp (#10922) @karthikeyann
  • Fix doxygen warnings in ast/, rolling, tdigest/, wrappers/, dictionary/ headers (#10921) @karthikeyann
  • fix doxygen warnings in cudf/io/types.hpp, other header files (#10913) @karthikeyann
  • fix doxygen warnings in cudf/io/ avro, csv, json, orc, parquet header files (#10912) @karthikeyann
  • Fix doxygen warnings in cudf/*.hpp (#10896) @karthikeyann
  • Add missing documentation in aggregation.hpp (#10887) @karthikeyann
  • Revise PR template. (#10774) @bdice

πŸš€ New Features

  • Change cmake to allow controlling Arrow version via cmake variable (#11429) @kkraus14
  • Adding support for list<int8> columns to be written as byte arrays in parquet (#11328) @hyperbolic2346
  • Adding byte array view structure (#11322) @hyperbolic2346
  • Adding byte_array statistics (#11303) @hyperbolic2346
  • Add column indexes to Parquet writer (#11302) @etseidl
  • Provide an Option for Default Integer and Floating Bitwidth (#11272) @isVoid
  • FST benchmark (#11243) @karthikeyann
  • Adds the Finite-State Transducer algorithm (#11242) @elstehle
  • Refactor collect_set to use cudf::distinct and cudf::lists::distinct (#11228) @ttnghia
  • Treat zstd as stable in nvcomp releases 2.3.2 and later (#11226) @jbrennan333
  • Add 24 bit dictionary support to Parquet writer (#11216) @devavret
  • Enable positive group indices for extractAllRecord on JNI (#11215) @anthony-chang
  • JNI bindings for NTH_ELEMENT window aggregation (#11201) @mythrocks
  • Add JNI bindings for extractAllRecord (#11196) @anthony-chang
  • Add cudf.options (#11193) @isVoid
  • Add thrift support for parquet column and offset indexes (#11178) @etseidl
  • Adding binary read/write as options for parquet (#11160) @hyperbolic2346
  • Support nth_element for window functions (#11158) @mythrocks
  • Implement lists::distinct and cudf::detail::stable_distinct (#11149) @ttnghia
  • Implement Groupby pct_change (#11144) @skirui-source
  • Add JNI for set operations (#11143) @ttnghia
  • Remove deprecated PERTHREADDEFAULT_STREAM (#11134) @jbrennan333
  • Added a Java method to check the existence of a list of keys in a map (#11128) @razajafri
  • Feature/python benchmarking (#11125) @vyasr
  • Support nan_equality in cudf::distinct (#11118) @ttnghia
  • Added JNI for getMapValueForKeys (#11104) @razajafri
  • Refactor semi_anti_join (#11100) @ttnghia
  • Replace remaining instances of rmm::cudastreamdefault with cudf::defaultstreamvalue (#11082) @jbrennan333
  • Adds the Logical Stack algorithm (#11078) @elstehle
  • Add doxygen-check pre-commit hook (#11076) @karthikeyann
  • Use new nvCOMP API to optimize the decompression temp memory size (#11064) @vuule
  • Add Doxygen CI check (#11057) @karthikeyann
  • Support duplicate_keep_option in cudf::distinct (#11052) @ttnghia
  • Support set operations (#11043) @ttnghia
  • Support for ZLIB compression in ORC writer (#11036) @vuule
  • Adding feature swaplevels (#11027) @VamsiTallam95
  • Use nvCOMP for ZLIB decompression in ORC reader (#11024) @vuule
  • Function for bfill, ffill #9591 (#11022) @Sreekiran096
  • Generate group offsets from element labels (#11017) @ttnghia
  • Feature axes (#10979) @VamsiTallam95
  • Generate group labels from offsets (#10945) @ttnghia
  • Add missing cuIO benchmark coverage for duration types (#10933) @vuule
  • Dask-cuDF cumulative groupby ops (#10889) @brandon-b-miller
  • Reindex Improvements (#10815) @brandon-b-miller
  • Implement value_counts for DataFrame (#10813) @martinfalisse

πŸ› οΈ Improvements

  • Pin numpy to &lt;1.23 (#11824) @galipremsagar
  • Make Index Join Tests on Default Precisions Deterministic (#11451) @isVoid
  • Pin dask & distributed for release (#11433) @galipremsagar
  • Use documented header template for doxygen (#11430) @galipremsagar
  • Relax arrow version in dev env (#11418) @galipremsagar
  • Added Java bindings for Parquet options for binary read (#11410) @razajafri
  • Allow CuPy 11 (#11393) @jakirkham
  • Improve multibyte_split performance (#11347) @cwharris
  • Switch death test to use explicit trap. (#11326) @vyasr
  • Add --output-on-failure to ctest args. (#11321) @vyasr
  • Consolidate remaining DataFrame/Series APIs (#11315) @vyasr
  • Add JNI support for the join_strings API (#11309) @revans2
  • Add cupy version to setup.py install_requires (#11306) @vyasr
  • removing some unused code (#11305) @hyperbolic2346
  • Add test of wildcard selection (#11300) @vyasr
  • Update parquet reader to take stream parameter (#11294) @PointKernel
  • Spark list hashing (#11292) @bdice
  • Remove legacy join APIs (#11274) @vyasr
  • Fix cudf recipes syntax (#11273) @ajschmidt8
  • Fix cudf recipe (#11267) @ajschmidt8
  • Cleanup config files (#11266) @vyasr
  • Run mypy on all packages (#11265) @vyasr
  • Update to isort 5.10.1. (#11262) @vyasr
  • Consolidate flake8 and pydocstyle configuration (#11260) @vyasr
  • Remove redundant black config specifications. (#11258) @vyasr
  • Ensure DeprecationWarnings are not introduced via pre-commit (#11255) @wence-
  • Optimization to gpu::PreprocessColumnData in parquet reader. (#11252) @nvdbaranec
  • Move rolling impl details to detail/ directory. (#11250) @mythrocks
  • Remove lists::drop_list_duplicates (#11236) @ttnghia
  • Use cudf::lists::distinct in Python binding (#11234) @ttnghia
  • Use cudf::lists::distinct in Java binding (#11233) @ttnghia
  • Use cudf::distinct in Java binding (#11232) @ttnghia
  • Pin dask-cuda in dev environment (#11229) @galipremsagar
  • Remove cruft in map_lookup (#11221) @mythrocks
  • Deprecate skiprows & num_rows in parquet reader (#11218) @galipremsagar
  • Remove Frame._index (#11210) @vyasr
  • Improve performance for cudf::contains when searching for a scalar (#11202) @ttnghia
  • Document why Development component is needing for CMake. (#11200) @vyasr
  • cleanup unused code in rolling_test.hpp (#11195) @karthikeyann
  • Standardize join internals around DataFrame (#11184) @vyasr
  • Move character case table declarations from src to detail (#11183) @davidwendt
  • Remove usage of Frame in StringMethods (#11181) @vyasr
  • Expose getjsonobject_options to Python (#11180) @SrikarVanavasam
  • Fix decimal128 stats in parquet writer (#11179) @etseidl
  • Modify CheckPageRows in parquet_test to use datasources (#11177) @etseidl
  • Pin max version of cuda-python to 11.7.0 (#11174) @Ethyling
  • Refactor and optimize Frame.where (#11168) @vyasr
  • Add npos const static member to cudf::string_view (#11166) @davidwendt
  • Move droprowsbylabel from Frame to IndexedFrame (#11157) @vyasr
  • Clean up copytype_metadata (#11156) @vyasr
  • Add nvcc conda package in dev environment (#11154) @galipremsagar
  • Struct binary comparison op functionality for spark rapids (#11153) @rwlee
  • Refactor inline conditionals. (#11151) @bdice
  • Refactor Spark hashing tests (#11145) @bdice
  • Add new _from_data_like_self factory (#11140) @vyasr
  • Update get_cucollections to use rapids-cmake (#11139) @vyasr
  • Remove unnecessary extra function for libcudacxx detection (#11138) @vyasr
  • Allow initial value for cudf::reduce and cudf::segmented_reduce. (#11137) @SrikarVanavasam
  • Remove Index.replace API (#11131) @vyasr
  • Move char-type table function declarations from src to detail (#11127) @davidwendt
  • Clean up repo root (#11124) @bdice
  • Improve print formatting of strings containing newline characters. (#11108) @nvdbaranec
  • Fix cudf::string_view::find() to return pos for empty string argument (#11107) @davidwendt
  • Forward-merge branch-22.06 to branch-22.08 (#11086) @bdice
  • Take iterators by value in clamp.cu. (#11084) @bdice
  • Performance improvements for row to column conversions (#11075) @hyperbolic2346
  • Remove deprecated Index methods from Frame (#11073) @vyasr
  • Use per-page max compressed size estimate for compression (#11066) @devavret
  • column to row refactor for performance (#11063) @hyperbolic2346
  • Include skbuild directory into build.sh clean operation (#11060) @galipremsagar
  • Unpin dask & distributed for development (#11058) @galipremsagar
  • Add support for Series.between (#11051) @galipremsagar
  • Fix groupby include (#11046) @bwyogatama
  • Regex cleanup internal reclass and reclass_device classes (#11045) @davidwendt
  • Remove public API of cudf.merge_sorted. (#11032) @bdice
  • Drop python 3.7 in code-base (#11029) @galipremsagar
  • Addition & integration of the integer power operator (#11025) @AtlantaPepsi
  • Refactor lists::contains (#11019) @ttnghia
  • Change build.sh to find C++ library by default and avoid shadowing CMAKE_ARGS (#11013) @vyasr
  • Clean up parquet unit test (#11005) @PointKernel
  • Add missing #pragma once to header files (#11004) @karthikeyann
  • Cleanup iterator.cuh and add fixed point support for scalar_optional_accessor (#10999) @ttnghia
  • Refactor cudf::contains (#10997) @ttnghia
  • Remove Arrow CUDA IPC code (#10995) @shwina
  • Change file extension for groupby benchmark (#10985) @ttnghia
  • Sort recipe include checks. (#10984) @bdice
  • Update cuCollections for thrust upgrade (#10983) @PointKernel
  • Expose row-group size options in cudf ParquetWriter (#10980) @rjzamora
  • Cleanup cudf::strings::detail::regex_parser class source (#10975) @davidwendt
  • Handle missing fields as nulls in getjsonobject() (#10970) @SrikarVanavasam
  • Fix license families to match all-caps expected by conda-verify. (#10931) @bdice
  • Include <optional> for GCC 11 compatibility. (#10927) @bdice
  • Enable builds with scikit-build (#10919) @vyasr
  • Improve distinct by using cuco::static_map::retrieve_all (#10916) @PointKernel
  • update cudfjni to 22.08.0-SNAPSHOT (#10910) @pxLi
  • Improve the capture of fatal cuda error (#10884) @sperlingxx
  • Cleanup regex compiler operators and operands source (#10879) @davidwendt
  • Buffer: make .ptr read-only (#10872) @madsbk
  • Configurable NaN handling in devicerowcomparators (#10870) @rwlee
  • Register cudf.core.groupby.Grouper objects to dask grouper_dispatch (#10838) @brandon-b-miller
  • Upgrade to arrow-8 (#10816) @galipremsagar
  • Remove getattr method in RangeIndex class (#10538) @skirui-source
  • Adding bins to value counts (#8247) @marlenezw

- C++
Published by GPUtester over 3 years ago

https://github.com/rapidsai/cudf - v22.08.00

🚨 Breaking Changes

  • Remove legacy join APIs (#11274) @vyasr
  • Remove lists::drop_list_duplicates (#11236) @ttnghia
  • Remove Index.replace API (#11131) @vyasr
  • Remove deprecated Index methods from Frame (#11073) @vyasr
  • Remove public API of cudf.merge_sorted. (#11032) @bdice
  • Drop python 3.7 in code-base (#11029) @galipremsagar
  • Return empty dataframe when reading a Parquet file using empty columns option (#11018) @vuule
  • Remove Arrow CUDA IPC code (#10995) @shwina
  • Buffer: make .ptr read-only (#10872) @madsbk

πŸ› Bug Fixes

  • Fix distributed error related to loop_in_thread (#11428) @galipremsagar
  • Relax arrow pinning to just 8.x and remove cuda build dependency from cudf recipe (#11412) @kkraus14
  • Revert "Allow CuPy 11" (#11409) @jakirkham
  • Fix moto timeouts (#11369) @galipremsagar
  • Set +/-infinity as the identity values for floating-point numbers in device operators min and max (#11357) @ttnghia
  • Fix memory_usage() for ListSeries (#11355) @thomcom
  • Fix constructing Column from column_view with expired mask (#11354) @shwina
  • Handle parquet corner case: Columns with more rows than are in the row group. (#11353) @nvdbaranec
  • Fix DatetimeIndex & TimedeltaIndex constructors (#11342) @galipremsagar
  • Fix unsigned-compare compile warning in IntPow binops (#11339) @davidwendt
  • Fix performance issue and add a new code path to cudf::detail::contains (#11330) @ttnghia
  • Pin pytorch to temporarily unblock from libcupti errors (#11289) @galipremsagar
  • Workaround for nvcomp zstd overwriting blocks for orc due to underestimate of sizes (#11288) @jbrennan333
  • Fix inconsistency when hashing two tables in cudf::detail::contains (#11284) @ttnghia
  • Fix issue related to numpy array and category dtype (#11282) @galipremsagar
  • Add NotImplementedError when on is specified in DataFrame.join. (#11275) @vyasr
  • Fix invalid allocatelike() and emptylike() tests. (#11268) @nvdbaranec
  • Returns DataFrame When Concating Along Axis 1 (#11263) @isVoid
  • Fix compile error due to missing header (#11257) @ttnghia
  • Fix a memory aliasing/crash issue in scatter for lists. (#11254) @nvdbaranec
  • Fix tests/rolling/empty_input_test (#11238) @ttnghia
  • Fix const qualifier when using host_span&lt;bitmask_type const*&gt; (#11220) @ttnghia
  • Avoid using nvcompBatchedDeflateDecompressGetTempSizeEx in cuIO (#11213) @vuule
  • Generate benchmark data with correct run length regardless of cardinality (#11205) @vuule
  • Fix cumulative count index behavior (#11188) @brandon-b-miller
  • Fix assertion in daskcudf teststruct_explode (#11170) @rjzamora
  • Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager (#11161) @res-life
  • Fix compatibility issues with pandas 1.4.3 (#11152) @vyasr
  • Ensure cuco export set is installed in cmake build (#11147) @jlowe
  • Avoid redundant deepcopy in cudf.from_pandas (#11142) @galipremsagar
  • Fix compile error due to missing header (#11126) @ttnghia
  • Fix __cuda_array_interface__ failures (#11113) @galipremsagar
  • Support octal and hex within regex character class pattern (#11112) @davidwendt
  • Fix split_re matching logic for word boundaries (#11106) @davidwendt
  • Handle multiple files metadata in read_parquet (#11105) @galipremsagar
  • Fix index alignment for Series objects with repeated index (#11103) @shwina
  • FindcuFile now searches in the current CUDA Toolkit location (#11101) @robertmaynard
  • Fix regex word boundary logic to include underline (#11099) @davidwendt
  • Exclude CudaFatalTest when selecting all Java tests (#11083) @jlowe
  • Fix duplicate cudatoolkit pinning issue (#11070) @galipremsagar
  • Maintain the input index in the result of a groupby-transform (#11068) @shwina
  • Fix bug with row count comparison for expectcolumnsequivalent(). (#11059) @nvdbaranec
  • Fix BPE uninitialized size value for null and empty input strings (#11054) @davidwendt
  • Include missing header for usage of get_current_device_resource() (#11047) @AtlantaPepsi
  • Fix warnunusedresult error in parquet test (#11026) @karthikeyann
  • Return empty dataframe when reading a Parquet file using empty columns option (#11018) @vuule
  • Fix small error in page row count limiting (#10991) @etseidl
  • Fix a row index entry error in ORC writer issue (#10989) @vuule
  • Fix grouped covariance to require both values to be convertible to double. (#10891) @bdice

πŸ“– Documentation

  • Fix issues with day & night modes in python docs (#11400) @galipremsagar
  • Update missing data handling APIs in docs (#11345) @galipremsagar
  • Add lists filtering APIs to doxygen group. (#11336) @bdice
  • Remove unused import in README sample (#11318) @vyasr
  • Note null behavior in where docs (#11276) @brandon-b-miller
  • Update docstring for spans in get_row_data_range (#11271) @vyasr
  • Update nvCOMP integration table (#11231) @vuule
  • Add dev docs for documentation writing (#11217) @vyasr
  • Documentation fix for concatenate (#11187) @dagardner-nv
  • Fix unresolved links in markdown (#11173) @karthikeyann
  • Fix cudf version in README.md install commands (#11164) @jvanstraten
  • Switch language from None to &quot;en&quot; in docs build (#11133) @galipremsagar
  • Remove docs mentioning scalar_view since no such class exists. (#11132) @bdice
  • Add docstring entry for DataFrame.value_counts (#11039) @galipremsagar
  • Add docs to rolling var, std, count. (#11035) @bdice
  • Fix docs for Numba UDFs. (#11020) @bdice
  • Replace column comparison utilities functions with macros (#11007) @karthikeyann
  • Fix Doxygen warnings in multiple headers files (#11003) @karthikeyann
  • Fix doxygen warnings in utilities/ headers (#10974) @karthikeyann
  • Fix Doxygen warnings in table header files (#10964) @karthikeyann
  • Fix Doxygen warnings in column header files (#10963) @karthikeyann
  • Fix Doxygen warnings in strings / header files (#10937) @karthikeyann
  • Generate Doxygen Tag File for Libcudf (#10932) @isVoid
  • Fix doxygen warnings in structs, lists headers (#10923) @karthikeyann
  • Fix doxygen warnings in fixed_point.hpp (#10922) @karthikeyann
  • Fix doxygen warnings in ast/, rolling, tdigest/, wrappers/, dictionary/ headers (#10921) @karthikeyann
  • fix doxygen warnings in cudf/io/types.hpp, other header files (#10913) @karthikeyann
  • fix doxygen warnings in cudf/io/ avro, csv, json, orc, parquet header files (#10912) @karthikeyann
  • Fix doxygen warnings in cudf/*.hpp (#10896) @karthikeyann
  • Add missing documentation in aggregation.hpp (#10887) @karthikeyann
  • Revise PR template. (#10774) @bdice

πŸš€ New Features

  • Change cmake to allow controlling Arrow version via cmake variable (#11429) @kkraus14
  • Adding support for list<int8> columns to be written as byte arrays in parquet (#11328) @hyperbolic2346
  • Adding byte array view structure (#11322) @hyperbolic2346
  • Adding byte_array statistics (#11303) @hyperbolic2346
  • Add column indexes to Parquet writer (#11302) @etseidl
  • Provide an Option for Default Integer and Floating Bitwidth (#11272) @isVoid
  • FST benchmark (#11243) @karthikeyann
  • Adds the Finite-State Transducer algorithm (#11242) @elstehle
  • Refactor collect_set to use cudf::distinct and cudf::lists::distinct (#11228) @ttnghia
  • Treat zstd as stable in nvcomp releases 2.3.2 and later (#11226) @jbrennan333
  • Add 24 bit dictionary support to Parquet writer (#11216) @devavret
  • Enable positive group indices for extractAllRecord on JNI (#11215) @anthony-chang
  • JNI bindings for NTH_ELEMENT window aggregation (#11201) @mythrocks
  • Add JNI bindings for extractAllRecord (#11196) @anthony-chang
  • Add cudf.options (#11193) @isVoid
  • Add thrift support for parquet column and offset indexes (#11178) @etseidl
  • Adding binary read/write as options for parquet (#11160) @hyperbolic2346
  • Support nth_element for window functions (#11158) @mythrocks
  • Implement lists::distinct and cudf::detail::stable_distinct (#11149) @ttnghia
  • Implement Groupby pct_change (#11144) @skirui-source
  • Add JNI for set operations (#11143) @ttnghia
  • Remove deprecated PERTHREADDEFAULT_STREAM (#11134) @jbrennan333
  • Added a Java method to check the existence of a list of keys in a map (#11128) @razajafri
  • Feature/python benchmarking (#11125) @vyasr
  • Support nan_equality in cudf::distinct (#11118) @ttnghia
  • Added JNI for getMapValueForKeys (#11104) @razajafri
  • Refactor semi_anti_join (#11100) @ttnghia
  • Replace remaining instances of rmm::cudastreamdefault with cudf::defaultstreamvalue (#11082) @jbrennan333
  • Adds the Logical Stack algorithm (#11078) @elstehle
  • Add doxygen-check pre-commit hook (#11076) @karthikeyann
  • Use new nvCOMP API to optimize the decompression temp memory size (#11064) @vuule
  • Add Doxygen CI check (#11057) @karthikeyann
  • Support duplicate_keep_option in cudf::distinct (#11052) @ttnghia
  • Support set operations (#11043) @ttnghia
  • Support for ZLIB compression in ORC writer (#11036) @vuule
  • Adding feature swaplevels (#11027) @VamsiTallam95
  • Use nvCOMP for ZLIB decompression in ORC reader (#11024) @vuule
  • Function for bfill, ffill #9591 (#11022) @Sreekiran096
  • Generate group offsets from element labels (#11017) @ttnghia
  • Feature axes (#10979) @VamsiTallam95
  • Generate group labels from offsets (#10945) @ttnghia
  • Add missing cuIO benchmark coverage for duration types (#10933) @vuule
  • Dask-cuDF cumulative groupby ops (#10889) @brandon-b-miller
  • Reindex Improvements (#10815) @brandon-b-miller
  • Implement value_counts for DataFrame (#10813) @martinfalisse

πŸ› οΈ Improvements

  • Pin dask & distributed for release (#11433) @galipremsagar
  • Use documented header template for doxygen (#11430) @galipremsagar
  • Relax arrow version in dev env (#11418) @galipremsagar
  • Allow CuPy 11 (#11393) @jakirkham
  • Improve multibyte_split performance (#11347) @cwharris
  • Switch death test to use explicit trap. (#11326) @vyasr
  • Add --output-on-failure to ctest args. (#11321) @vyasr
  • Consolidate remaining DataFrame/Series APIs (#11315) @vyasr
  • Add JNI support for the join_strings API (#11309) @revans2
  • Add cupy version to setup.py install_requires (#11306) @vyasr
  • removing some unused code (#11305) @hyperbolic2346
  • Add test of wildcard selection (#11300) @vyasr
  • Update parquet reader to take stream parameter (#11294) @PointKernel
  • Spark list hashing (#11292) @bdice
  • Remove legacy join APIs (#11274) @vyasr
  • Fix cudf recipes syntax (#11273) @ajschmidt8
  • Fix cudf recipe (#11267) @ajschmidt8
  • Cleanup config files (#11266) @vyasr
  • Run mypy on all packages (#11265) @vyasr
  • Update to isort 5.10.1. (#11262) @vyasr
  • Consolidate flake8 and pydocstyle configuration (#11260) @vyasr
  • Remove redundant black config specifications. (#11258) @vyasr
  • Ensure DeprecationWarnings are not introduced via pre-commit (#11255) @wence-
  • Optimization to gpu::PreprocessColumnData in parquet reader. (#11252) @nvdbaranec
  • Move rolling impl details to detail/ directory. (#11250) @mythrocks
  • Remove lists::drop_list_duplicates (#11236) @ttnghia
  • Use cudf::lists::distinct in Python binding (#11234) @ttnghia
  • Use cudf::lists::distinct in Java binding (#11233) @ttnghia
  • Use cudf::distinct in Java binding (#11232) @ttnghia
  • Pin dask-cuda in dev environment (#11229) @galipremsagar
  • Remove cruft in map_lookup (#11221) @mythrocks
  • Deprecate skiprows & num_rows in parquet reader (#11218) @galipremsagar
  • Remove Frame._index (#11210) @vyasr
  • Improve performance for cudf::contains when searching for a scalar (#11202) @ttnghia
  • Document why Development component is needing for CMake. (#11200) @vyasr
  • cleanup unused code in rolling_test.hpp (#11195) @karthikeyann
  • Standardize join internals around DataFrame (#11184) @vyasr
  • Move character case table declarations from src to detail (#11183) @davidwendt
  • Remove usage of Frame in StringMethods (#11181) @vyasr
  • Expose getjsonobject_options to Python (#11180) @SrikarVanavasam
  • Fix decimal128 stats in parquet writer (#11179) @etseidl
  • Modify CheckPageRows in parquet_test to use datasources (#11177) @etseidl
  • Pin max version of cuda-python to 11.7.0 (#11174) @Ethyling
  • Refactor and optimize Frame.where (#11168) @vyasr
  • Add npos const static member to cudf::string_view (#11166) @davidwendt
  • Move droprowsbylabel from Frame to IndexedFrame (#11157) @vyasr
  • Clean up copytype_metadata (#11156) @vyasr
  • Add nvcc conda package in dev environment (#11154) @galipremsagar
  • Struct binary comparison op functionality for spark rapids (#11153) @rwlee
  • Refactor inline conditionals. (#11151) @bdice
  • Refactor Spark hashing tests (#11145) @bdice
  • Add new _from_data_like_self factory (#11140) @vyasr
  • Update get_cucollections to use rapids-cmake (#11139) @vyasr
  • Remove unnecessary extra function for libcudacxx detection (#11138) @vyasr
  • Allow initial value for cudf::reduce and cudf::segmented_reduce. (#11137) @SrikarVanavasam
  • Remove Index.replace API (#11131) @vyasr
  • Move char-type table function declarations from src to detail (#11127) @davidwendt
  • Clean up repo root (#11124) @bdice
  • Improve print formatting of strings containing newline characters. (#11108) @nvdbaranec
  • Fix cudf::string_view::find() to return pos for empty string argument (#11107) @davidwendt
  • Forward-merge branch-22.06 to branch-22.08 (#11086) @bdice
  • Take iterators by value in clamp.cu. (#11084) @bdice
  • Performance improvements for row to column conversions (#11075) @hyperbolic2346
  • Remove deprecated Index methods from Frame (#11073) @vyasr
  • Use per-page max compressed size estimate for compression (#11066) @devavret
  • column to row refactor for performance (#11063) @hyperbolic2346
  • Include skbuild directory into build.sh clean operation (#11060) @galipremsagar
  • Unpin dask & distributed for development (#11058) @galipremsagar
  • Add support for Series.between (#11051) @galipremsagar
  • Fix groupby include (#11046) @bwyogatama
  • Regex cleanup internal reclass and reclass_device classes (#11045) @davidwendt
  • Remove public API of cudf.merge_sorted. (#11032) @bdice
  • Drop python 3.7 in code-base (#11029) @galipremsagar
  • Addition & integration of the integer power operator (#11025) @AtlantaPepsi
  • Refactor lists::contains (#11019) @ttnghia
  • Change build.sh to find C++ library by default and avoid shadowing CMAKE_ARGS (#11013) @vyasr
  • Clean up parquet unit test (#11005) @PointKernel
  • Add missing #pragma once to header files (#11004) @karthikeyann
  • Cleanup iterator.cuh and add fixed point support for scalar_optional_accessor (#10999) @ttnghia
  • Refactor cudf::contains (#10997) @ttnghia
  • Remove Arrow CUDA IPC code (#10995) @shwina
  • Change file extension for groupby benchmark (#10985) @ttnghia
  • Sort recipe include checks. (#10984) @bdice
  • Update cuCollections for thrust upgrade (#10983) @PointKernel
  • Expose row-group size options in cudf ParquetWriter (#10980) @rjzamora
  • Cleanup cudf::strings::detail::regex_parser class source (#10975) @davidwendt
  • Handle missing fields as nulls in getjsonobject() (#10970) @SrikarVanavasam
  • Fix license families to match all-caps expected by conda-verify. (#10931) @bdice
  • Include <optional> for GCC 11 compatibility. (#10927) @bdice
  • Enable builds with scikit-build (#10919) @vyasr
  • Improve distinct by using cuco::static_map::retrieve_all (#10916) @PointKernel
  • update cudfjni to 22.08.0-SNAPSHOT (#10910) @pxLi
  • Improve the capture of fatal cuda error (#10884) @sperlingxx
  • Cleanup regex compiler operators and operands source (#10879) @davidwendt
  • Buffer: make .ptr read-only (#10872) @madsbk
  • Configurable NaN handling in devicerowcomparators (#10870) @rwlee
  • Register cudf.core.groupby.Grouper objects to dask grouper_dispatch (#10838) @brandon-b-miller
  • Upgrade to arrow-8 (#10816) @galipremsagar
  • Remove getattr method in RangeIndex class (#10538) @skirui-source
  • Adding bins to value counts (#8247) @marlenezw

- C++
Published by GPUtester over 3 years ago

https://github.com/rapidsai/cudf - v22.06.01

v22.06.01

- C++
Published by GPUtester over 3 years ago

https://github.com/rapidsai/cudf - v22.06.00

🚨 Breaking Changes

  • Enable Zstandard decompression only when all nvcomp integrations are enabled (#10944) @vuule
  • Rename sliced_child to get_sliced_child. (#10885) @bdice
  • Add parameters to control page size in Parquet writer (#10882) @etseidl
  • Make cudf::test::expectcolumnsequal() to fail when comparing unsanitary lists. (#10880) @nvdbaranec
  • Cleanup regex compiler fixed quantifiers source (#10843) @davidwendt
  • Refactor cudf::contains, renaming and switching parameters role (#10802) @ttnghia
  • Generic serialization of all column types (#10784) @wence-
  • Return per-file metadata from readers (#10782) @vuule
  • HostColumnVectoreCore#isNull should return true for out-of-range rows (#10779) @gerashegalov
  • Update groupby::hash to use new row operators for keys (#10770) @PointKernel
  • update mangledupecols behavior in csv reader to match pandas 1.4.0 behavior (#10749) @karthikeyann
  • Rename CUDATRY macro to CUDFCUDATRY, rename CHECKCUDA macro to CUDFCHECKCUDA. (#10589) @bdice
  • Upgrade cudf to support pandas 1.4.x versions (#10584) @galipremsagar
  • Move binop methods from Frame to IndexedFrame and standardize the docstring (#10576) @vyasr
  • Add default= kwarg to .list.get() accessor method (#10547) @shwina
  • Remove deprecated decimal_cols_as_float in the ORC reader (#10515) @vuule
  • Support nvComp 2.3 if local, otherwise use nvcomp 2.2 (#10513) @robertmaynard
  • Fix findall_record to return empty list for no matches (#10491) @davidwendt
  • Namespace/Docstring Fixes for Reduction (#10471) @isVoid
  • Additional refactoring of hash functions (#10462) @bdice
  • Fix default value of str.split expand parameter. (#10457) @bdice
  • Remove deprecated code. (#10450) @vyasr

πŸ› Bug Fixes

  • Fix single column MultiIndex issue in sort_index (#10957) @galipremsagar
  • Make SerializedTableHeader(numRows) public (#10949) @gerashegalov
  • Fix gcc_linux version pinning in dev environment (#10943) @galipremsagar
  • Fix an issue with reading raw string in cudf.read_json (#10924) @galipremsagar
  • Make cudf::test::expectcolumnsequal() to fail when comparing unsanitary lists. (#10880) @nvdbaranec
  • Fix segmented_reduce on empty column with non-empty offsets (#10876) @davidwendt
  • Fix dask-cudf groupby handling when grouping by all columns (#10866) @charlesbluca
  • Fix a bug in distinct: using nested nulls logic (#10848) @PointKernel
  • Fix constness / references in weak ordering operator() signatures. (#10846) @bdice
  • Suppress sizeof-array-div warnings in thrust found by gcc-11 (#10840) @robertmaynard
  • Add handling for string by-columns in dask-cudf groupby (#10830) @charlesbluca
  • Fix compile warning in search.cu (#10827) @davidwendt
  • Fix element access const correctness in hostdevice_vector (#10804) @vuule
  • Update cuco git tag (#10788) @PointKernel
  • HostColumnVectoreCore#isNull should return true for out-of-range rows (#10779) @gerashegalov
  • Fixing deprecation warnings in test_orc.py (#10772) @hyperbolic2346
  • Enable writing to s3 storage in chunked parquet writer (#10769) @galipremsagar
  • Fix construction of nested structs with EMPTY child (#10761) @shwina
  • Fix replace error when regex has only zero match quantifiers (#10760) @davidwendt
  • Fix an issue with onelevellist schemas in parquet reader. (#10750) @nvdbaranec
  • update mangledupecols behavior in csv reader to match pandas 1.4.0 behavior (#10749) @karthikeyann
  • Fix cupy function in notebook (#10737) @ajschmidt8
  • Fix fillna to retain columns when it is MultiIndex (#10729) @galipremsagar
  • Fix scatter for all-empty-string column case (#10724) @davidwendt
  • Retain series name in Series.apply (#10716) @brandon-b-miller
  • Correct build dir cudf-config dependency issues for static builds (#10704) @robertmaynard
  • Fix list of testing requirements in setup.py. (#10678) @bdice
  • Fix rounding to zero error in stod on very small float numbers (#10672) @davidwendt
  • cuco isn't a cudf dependency when we are built shared (#10662) @robertmaynard
  • Fix to_timestamps to support Z for %z format specifier (#10617) @davidwendt
  • Verify compression type in Parquet reader (#10610) @vuule
  • Fix struct row comparator's exception on empty structs (#10604) @sperlingxx
  • Fix strings strip() to accept only str Scalar for to_strip parameter (#10597) @davidwendt
  • Fix hasatomicsupport check in canusehash_groupby() (#10588) @jbrennan333
  • Revert Thrust 1.16 to Thrust 1.15 (#10586) @bdice
  • Fix missing RMMSTATICCUDART define when compiling JNI with static CUDA runtime (#10585) @jlowe
  • pin more cmake versions (#10570) @robertmaynard
  • Re-enable Build Metrics Report (#10562) @davidwendt
  • Remove statically linked CUDA runtime check in Java build (#10532) @jlowe
  • Fix temp data cleanup in test_text.py (#10524) @brandon-b-miller
  • Update pre-commit to run black 22.3.0 (#10523) @vyasr
  • Remove deprecated decimal_cols_as_float in the ORC reader (#10515) @vuule
  • Fix findall_record to return empty list for no matches (#10491) @davidwendt
  • Allow users to specify data types for a subset of columns in read_csv (#10484) @vuule
  • Fix default value of str.split expand parameter. (#10457) @bdice
  • Improve coverage of dask-cudf's groupby aggregation, add tests for dropna support (#10449) @charlesbluca
  • Allow string aggs for dask_cudf.CudfDataFrameGroupBy.aggregate (#10222) @charlesbluca
  • In-place updates with loc or iloc don't work correctly when the LHS has more than one column (#9918) @skirui-source

πŸ“– Documentation

  • Clarify append deprecation notice. (#10930) @bdice
  • Use full name of GPUDirect Storage SDK in docs (#10904) @vuule
  • Update Dask + Pandas to Dask + cuDF path (#10897) @miguelusque
  • Add missing documentation in cudf/types.hpp (#10895) @karthikeyann
  • Add strong index iterator docs. (#10888) @bdice
  • spell check fixes (#10865) @karthikeyann
  • Add missing documentation in scalar/ headers (#10861) @karthikeyann
  • Remove typo in ngram documentation (#10859) @miguelusque
  • fix doxygen warnings (#10842) @karthikeyann
  • Add a library_design.md file documenting the core Python data structures and their relationship (#10817) @vyasr
  • Add NumPy to intersphinx references. (#10809) @bdice
  • Add a section to the docs that compares cuDF with Pandas (#10796) @shwina
  • Mention 2 cpp-reviewer requirement in pull request template (#10768) @davidwendt
  • Enable pydocstyle for all packages. (#10759) @bdice
  • Enable pydocstyle rules involving quotes (#10748) @vyasr
  • Revise 10 minutes notebook. (#10738) @bdice
  • Reorganize cuDF Python docs (#10691) @shwina
  • Fix sphinx/jupyter heading issue in UDF notebook (#10690) @brandon-b-miller
  • Migrated user guide notebooks to MyST-NB and added sphinx extension (#10685) @mmccarty
  • add data generation to benchmark documentation (#10677) @karthikeyann
  • Fix some docs build warnings (#10674) @galipremsagar
  • Update UDF notebook in User Guide. (#10668) @bdice
  • Improve User Guide docs (#10663) @bdice
  • Fix some docstrings formatting (#10660) @galipremsagar
  • Remove implementation details from apply docstrings (#10651) @brandon-b-miller
  • Revise CONTRIBUTING.md (#10644) @bdice
  • Add missing APIs to documentation. (#10643) @bdice
  • Use cudf.read_json as documented API name. (#10640) @bdice
  • Fix docstring section headings. (#10639) @bdice
  • Document cudf.readtext and cudf.readavro. (#10638) @bdice
  • Fix type-o in docstring for jsonreaderoptions (#10627) @dagardner-nv
  • Update guide to UDFs with notes about Series.applymap deprecation and related changes (#10607) @brandon-b-miller
  • Fix doxygen Modules page for cudf::lists::sequences (#10561) @davidwendt
  • Add Replace Backreferences section to Regex Features page (#10560) @davidwendt
  • Introduce deprecation policy to developer guide. (#10252) @vyasr

πŸš€ New Features

  • Enable Zstandard decompression only when all nvcomp integrations are enabled (#10944) @vuule
  • Handle nested types in cudf::concatenate_rows() (#10890) @nvdbaranec
  • Strong index types for equality comparator (#10883) @ttnghia
  • Add parameters to control page size in Parquet writer (#10882) @etseidl
  • Support for Zstandard decompression in ORC reader (#10873) @vuule
  • Use pre-built nvcomp 2.3 binaries by default (#10851) @robertmaynard
  • Support for Zstandard decompression in Parquet reader (#10847) @vuule
  • Add JNI support for applybooleanmask (#10812) @res-life
  • Segmented Min/Max for Fixed Point Types (#10794) @isVoid
  • Return per-file metadata from readers (#10782) @vuule
  • Segmented apply_boolean_mask for LIST columns (#10773) @mythrocks
  • Update groupby::hash to use new row operators for keys (#10770) @PointKernel
  • Support purging non-empty null elements from LIST/STRING columns (#10701) @mythrocks
  • Add detail::hash_join (#10695) @PointKernel
  • Persist string statistics data across multiple calls to orc chunked write (#10694) @hyperbolic2346
  • Add .list.astype() to cast list leaves to specified dtype (#10693) @shwina
  • JNI: Add generateListOffsets API (#10683) @sperlingxx
  • Support args in groupby apply (#10682) @brandon-b-miller
  • Enable segmented_gather in Java package (#10669) @sperlingxx
  • Add row hasher with nested column support (#10641) @devavret
  • Add support for numericonly in DataFrame.reduce (#10629) @martinfalisse
  • First step toward statistics in ORC files with chunked writes (#10567) @hyperbolic2346
  • Add support for struct columns to the random table generator (#10566) @vuule
  • Enable passing a sequence for the index argument to .list.get() (#10564) @shwina
  • Add python bindings for cudf::list::index_of (#10549) @ChrisJar
  • Add default= kwarg to .list.get() accessor method (#10547) @shwina
  • Add cudf.DataFrame.applymap (#10542) @brandon-b-miller
  • Support nvComp 2.3 if local, otherwise use nvcomp 2.2 (#10513) @robertmaynard
  • Add column field ID control in parquet writer (#10504) @PointKernel
  • Deprecate Series.applymap (#10497) @brandon-b-miller
  • Add option to drop cache in cuIO benchmarks (#10488) @vuule
  • move benchmark input generation in device in reduction nvbench (#10486) @karthikeyann
  • Support Segmented Min/Max Reduction on String Type (#10447) @isVoid
  • List element Equality comparator (#10289) @devavret
  • Implement all methods of groupby rank aggregation in libcudf, python (#9569) @karthikeyann
  • Implement DataFrame.eval using libcudf ASTs (#8022) @vyasr

πŸ› οΈ Improvements

  • Use conda compilers in env file (#10915) @galipremsagar
  • Remove C style artifacts in cuIO (#10886) @vuule
  • Rename sliced_child to get_sliced_child. (#10885) @bdice
  • Replace defaulted stream value for libcudf APIs that use NVCOMP (#10877) @jbrennan333
  • Add more unit tests for cudf::distinct for nested types with sliced input (#10860) @ttnghia
  • Changing list_view.cuh to list_view.hpp (#10854) @ttnghia
  • More error checking in from_dlpack (#10850) @wence-
  • Cleanup regex compiler fixed quantifiers source (#10843) @davidwendt
  • Adds the JNI call for Cuda.deviceSynchronize (#10839) @abellina
  • Add missing cuda-python dependency to cudf (#10833) @bdice
  • Change std::string parameters in cudf::strings APIs to std::string_view (#10832) @davidwendt
  • Split up search.cu to improve compile time (#10831) @davidwendt
  • Add tests for null scalar binaryops (#10828) @brandon-b-miller
  • Cleanup regex compile optimize functions (#10825) @davidwendt
  • Use ThreadedMotoServer instead of subprocess in spinning up s3 server (#10822) @galipremsagar
  • Import NA from missing rather than using cudf.NA everywhere (#10821) @brandon-b-miller
  • Refactor regex builtin character-class identifiers (#10814) @davidwendt
  • Change pattern parameter for regex APIs from std::string to std::string_view (#10810) @davidwendt
  • Make the JNI API to get list offsets as a view public. (#10807) @revans2
  • Add cudf JNI docker build github action (#10806) @pxLi
  • Removed mr parameter from inplace bitmask operations (#10805) @AtlantaPepsi
  • Refactor cudf::contains, renaming and switching parameters role (#10802) @ttnghia
  • Handle closed property in IntervalDtype.from_pandas (#10798) @wence-
  • Return weak orderings from device_row_comparator. (#10793) @rwlee
  • Rework Scalar imports (#10791) @brandon-b-miller
  • Enable ccache for cudfjni build in Docker (#10790) @gerashegalov
  • Generic serialization of all column types (#10784) @wence-
  • simplifying skiprows test in test_orc.py (#10783) @hyperbolic2346
  • Use columnviews instead of columndevice_views in binary operations. (#10780) @bdice
  • Add struct utility functions. (#10776) @bdice
  • Add multiple rows to subword tokenizer benchmark (#10767) @davidwendt
  • Refactor host decompression in ORC reader (#10764) @vuule
  • Flush output streams before creating a process to drop caches (#10762) @vuule
  • Refactor binaryop/compiled/util.cpp (#10756) @bdice
  • Use warp per string for long strings in cudf::strings::contains() (#10739) @davidwendt
  • Use generator expressions in any/all functions. (#10736) @bdice
  • Use canonical "magic methods" (replace x.__repr__() with repr(x)). (#10735) @bdice
  • Improve use of isinstance. (#10734) @bdice
  • Rename tests from multiIndex to multiindex. (#10732) @bdice
  • Two-table comparators with strong index types (#10730) @bdice
  • Replace std::make_pair with std::pair (C++17 CTAD) (#10727) @karthikeyann
  • Use structured bindings instead of std::tie (#10726) @karthikeyann
  • Missing f prefix on f-strings fix (#10721) @code-review-doctor
  • Add max_file_size parameter to chunked parquet dataset writer (#10718) @galipremsagar
  • Deprecate merge_sorted, change dask cudf usage to internal method (#10713) @isVoid
  • Prepare daskcudf testparquet.py for upcoming API changes (#10709) @rjzamora
  • Remove or simplify various utility functions (#10705) @vyasr
  • Allow building arrow with parquet and not python (#10702) @revans2
  • Partial cuIO GPU decompression refactor (#10699) @vuule
  • Cython API refactor: merge.pyx (#10698) @isVoid
  • Fix random string data length to become variable (#10697) @galipremsagar
  • Add bindings for index_of with column search key (#10696) @ChrisJar
  • Deprecate index merging (#10689) @vyasr
  • Remove cudf::strings::string namespace (#10684) @davidwendt
  • Standardize imports. (#10680) @bdice
  • Standardize usage of collections.abc. (#10679) @bdice
  • Cython API Refactor: transpose.pyx, sort.pyx (#10675) @isVoid
  • Add devicememoryresource parameter to createstringvectorfromcolumn (#10673) @davidwendt
  • Split up mixed-join kernels source files (#10671) @davidwendt
  • Use std::filesystem for temporary directory location and deletion (#10664) @vuule
  • cleanup benchmark includes (#10661) @karthikeyann
  • Use upstream clang-format pre-commit hook. (#10659) @bdice
  • Clean up C++ includes to use <> instead of "". (#10658) @bdice
  • Handle RuntimeError thrown by CUDA Python in validate_setup (#10653) @shwina
  • Rework JNI CMake to leverage rapidsfindpackage (#10649) @jlowe
  • Use conda to build python packages during GPU tests (#10648) @Ethyling
  • Deprecate various functions that don't need to be defined for Index. (#10647) @vyasr
  • Update pinning to allow newer CMake versions. (#10646) @vyasr
  • Bump hadoop-common from 3.1.4 to 3.2.3 in /java (#10645) @dependabot[bot]
  • Remove concurrent_unordered_multimap. (#10642) @bdice
  • Improve parquet dictionary encoding (#10635) @PointKernel
  • Improve cudf::cuda_error (#10630) @sperlingxx
  • Add support for null and non-numeric types in Series.diff and DataFrame.diff (#10625) @Matt711
  • Branch 22.06 merge 22.04 (#10624) @vyasr
  • Unpin dask & distributed for development (#10623) @galipremsagar
  • Slightly improve accuracy of stod in to_floats (#10622) @davidwendt
  • Allow libcudfjni to be built as a static library (#10619) @jlowe
  • Change stack-based regex state data to use global memory (#10600) @davidwendt
  • Resolve Forward merging of branch-22.04 into branch-22.06 (#10598) @galipremsagar
  • KvikIO as an alternative GDS backend (#10593) @madsbk
  • Rename CUDATRY macro to CUDFCUDATRY, rename CHECKCUDA macro to CUDFCHECKCUDA. (#10589) @bdice
  • Upgrade cudf to support pandas 1.4.x versions (#10584) @galipremsagar
  • Refactor binary ops for timedelta and datetime columns (#10581) @vyasr
  • Refactor cudf::strings::countre API to use countmatches utility (#10580) @davidwendt
  • Update Programming Language :: Python Versions to 3.8 & 3.9 (#10579) @madsbk
  • Automate Java cudf jar build with statically linked dependencies (#10578) @gerashegalov
  • Add patch for thrust-cub 1.16 to fix sort compile times (#10577) @davidwendt
  • Move binop methods from Frame to IndexedFrame and standardize the docstring (#10576) @vyasr
  • Cleanup libcudf strings regex classes (#10573) @davidwendt
  • Simplify preprocessing of arguments for DataFrame binops (#10563) @vyasr
  • Reduce kernel calls to build strings findall results (#10559) @davidwendt
  • Forward-merge branch-22.04 to branch-22.06 (#10557) @bdice
  • Update strings contains benchmark to measure varying match rates (#10555) @davidwendt
  • JNI: throw CUDA errors more specifically (#10551) @sperlingxx
  • Enable building static libs (#10545) @trxcllnt
  • Remove pip requirements files. (#10543) @bdice
  • Remove Click pinnings that are unnecessary after upgrading black. (#10541) @vyasr
  • Refactor memory_usage to improve performance (#10537) @galipremsagar
  • Adjust the valid range of group index for replacewithbackrefs (#10530) @sperlingxx
  • add accidentally removed comment. (#10526) @vyasr
  • Update conda environment. (#10525) @vyasr
  • Remove ColumnBase.getitem (#10516) @vyasr
  • Optimize left_semi_join by materializing the gather mask (#10511) @cheinger
  • Define proper binary operation APIs for columns (#10509) @vyasr
  • Upgrade arrow-cpp & pyarrow to 7.0.0 (#10503) @galipremsagar
  • Update to Thrust 1.16 (#10489) @bdice
  • Namespace/Docstring Fixes for Reduction (#10471) @isVoid
  • Update cudfjni 22.06.0-SNAPSHOT (#10467) @pxLi
  • Use Lists of Columns for Various Files (#10463) @isVoid
  • Additional refactoring of hash functions (#10462) @bdice
  • Fix Series.str.findall behavior for expand=False. (#10459) @bdice
  • Remove deprecated code. (#10450) @vyasr
  • Update cmake-format version. (#10440) @vyasr
  • Consolidate C++ conda recipes and add libcudf-tests package (#10326) @ajschmidt8
  • Use conda compilers (#10275) @Ethyling
  • Add row bitmask as a detail::hash_join member (#10248) @PointKernel

- C++
Published by GPUtester over 3 years ago

https://github.com/rapidsai/cudf - v22.04.00

🚨 Breaking Changes

  • Drop unsupported method argument from nunique and distinct_count. (#10411) @bdice
  • Refactor stream compaction APIs (#10370) @PointKernel
  • Add scanaggregation and reduceaggregation derived types. (#10357) @nvdbaranec
  • Avoid decimal type narrowing for decimal binops (#10299) @galipremsagar
  • Rewrites sample API (#10262) @isVoid
  • Remove probe-time null equality parameters in cudf::hash_join (#10260) @PointKernel
  • Enable proper Index round-tripping in orc reader and writer (#10170) @galipremsagar
  • Add JNI for strings::split_re and strings::split_record_re (#10139) @ttnghia
  • Change cudf::strings::find_multiple to return a lists column (#10134) @davidwendt
  • Remove the option to completely disable decimal128 columns in the ORC reader (#10127) @vuule
  • Remove deprecated code (#10124) @vyasr
  • Update gpu_utils.py to reflect current CUDA support. (#10113) @bdice
  • Optimize compaction operations (#10030) @PointKernel
  • Remove deprecated method Series.set_index. (#9945) @bdice
  • Add cudf::strings::findall_record API (#9911) @davidwendt
  • Upgrade arrow & pyarrow to 6.0.1 (#9686) @galipremsagar

πŸ› Bug Fixes

  • Fix an issue with tdigest merge aggregations. (#10506) @nvdbaranec
  • Batch of fixes for index overflows in grid stride loops. (#10448) @nvdbaranec
  • Update dask_cudf imports to be compatible with latest dask (#10442) @rlratzel
  • Fix for integer overflow in contiguous-split (#10437) @jbrennan333
  • Fix hasnull predicate for droplist_duplicates on nested structs (#10436) @sperlingxx
  • Fix empty reduce with List output and non-List input (#10435) @sperlingxx
  • Fix list and struct meta generation issue in dask-cudf (#10434) @galipremsagar
  • Fix error in cudf.to_numeric when a bool input is passed (#10431) @galipremsagar
  • Support cupy array in quantile input (#10429) @galipremsagar
  • Fix benchmarks to work with new aggregation types (#10428) @davidwendt
  • Fix cudf::shift to handle offset greater than column size (#10414) @davidwendt
  • Fix lifespan of the temporary directory that holds cuFile configuration file (#10403) @vuule
  • Fix error thrown in compiled-binaryop benchmark (#10398) @davidwendt
  • Limiting async allocator using alignment of 512 (#10395) @rongou
  • Include <optional> in multibyte split. (#10385) @bdice
  • Fix issue with column and scalar re-assignment (#10377) @galipremsagar
  • Fix floating point data generation in benchmarks (#10372) @vuule
  • Avoid overflow in fusedconcatenatekernel output_index (#10344) @abellina
  • Remove isrelationallycomparable for table device views (#10342) @davidwendt
  • Fix debug compile error in devicespan to columnview conversion (#10331) @davidwendt
  • Add Pascal support to JCUDF transcode (row_conversion) (#10329) @mythrocks
  • Fix std::bad_alloc exception due to JIT reserving a huge buffer (#10317) @ttnghia
  • Fixes up the overflowed fixed-point round on nullable column (#10316) @sperlingxx
  • Fix DataFrame slicing issues for empty cases (#10310) @brandon-b-miller
  • Fix documentation issues (#10307) @ajschmidt8
  • Allow Java bindings to use default decimal precisions when writing columns (#10276) @sperlingxx
  • Fix incorrect slicing of GDS read/write calls (#10274) @vuule
  • Fix out-of-memory error in compiled-binaryop benchmark (#10269) @davidwendt
  • Add tests of reflected ufuncs and fix behavior of logical reflected ufuncs (#10261) @vyasr
  • Remove probe-time null equality parameters in cudf::hash_join (#10260) @PointKernel
  • Fix out-of-memory error in UrlDecode benchmark (#10258) @davidwendt
  • Fix groupby reductions that perform operations on source type instead of target type (#10250) @ttnghia
  • Fix small leak in explode (#10245) @revans2
  • Yet another small JNI memory leak (#10238) @revans2
  • Fix regex octal parsing to limit to 3 characters (#10233) @davidwendt
  • Fix string to decimal128 conversion handling large exponents (#10231) @davidwendt
  • Fix JNI leak on copy to device (#10229) @revans2
  • Fix the data generator element size for decimal types (#10225) @vuule
  • Fix decimal metadata in parquet writer (#10224) @galipremsagar
  • Fix strings handling of hex in regex pattern (#10220) @davidwendt
  • Fix docs builds (#10216) @ajschmidt8
  • Fix a leftover hasnulls change from Nullate (#10211) @devavret
  • Fix bitmask of the output for JNI of lists::drop_list_duplicates (#10210) @ttnghia
  • Fix compile error in binaryop/compiled/util.cpp (#10209) @ttnghia
  • Skip ORC and Parquet readers' benchmark cases that are not currently supported (#10194) @vuule
  • Fix JNI leak of a cudf::column_view native class. (#10171) @revans2
  • Enable proper Index round-tripping in orc reader and writer (#10170) @galipremsagar
  • Convert Column Name to String Before Using Struct Column Factory (#10156) @isVoid
  • Preserve the correct ListDtype while creating an identical empty column (#10151) @galipremsagar
  • benchmark fixture - static object pointer fix (#10145) @karthikeyann
  • Fix UDF Caching (#10133) @brandon-b-miller
  • Raise duplicate column error in DataFrame.rename (#10120) @galipremsagar
  • Fix flaky memory usage test by guaranteeing array size. (#10114) @vyasr
  • Encode values from python callback for C++ (#10103) @jdye64
  • Add check for regex instructions causing an infinite-loop (#10095) @davidwendt
  • Remove metadata singleton from nvtext normalizer (#10090) @davidwendt
  • Column equality testing fixes (#10011) @brandon-b-miller
  • Pin libcudf runtime dependency for cudf / libcudf-kafka nightlies (#9847) @charlesbluca

πŸ“– Documentation

  • Fix documentation for DataFrame.corr and Series.corr. (#10493) @bdice
  • Add cut to API docs (#10479) @shwina
  • Remove documentation for methods removed in #10124. (#10366) @bdice
  • Fix documentation issues (#10306) @ajschmidt8
  • Fix fixed_point binary operation documentation (#10198) @codereport
  • Remove cleaned up methods from docs (#10189) @galipremsagar
  • Update developer guide to recommend no default stream parameter. (#10136) @bdice
  • Update benchmarking guide to use NVBench. (#10093) @bdice

πŸš€ New Features

  • Add StringIO support to read_text (#10465) @cwharris
  • Add support for tdigest and merge_tdigest aggregations through cudf::reduce (#10433) @nvdbaranec
  • JNI support for Collect Ops in Reduction (#10427) @sperlingxx
  • Enable readtext with daskcudf using byte_range (#10407) @ChrisJar
  • Add cudf::stable_sort_by_key (#10387) @PointKernel
  • Implement maps_column_view abstraction over LIST&lt;STRUCT&lt;K,V&gt;&gt; (#10380) @mythrocks
  • Support Java bindings for Avro reader (#10373) @HaoYang670
  • Refactor stream compaction APIs (#10370) @PointKernel
  • Support collect aggregations in reduction (#10353) @sperlingxx
  • Refactor array_ufunc for Index and unify across all classes (#10346) @vyasr
  • Add JNI for extractlistelement with index column (#10341) @firestarman
  • Support min and max operations for structs in rolling window (#10332) @ttnghia
  • Add device createsequencetable for benchmarks (#10300) @karthikeyann
  • Enable numpy ufuncs for DataFrame (#10287) @vyasr
  • move input generation for json benchmark to device (#10281) @karthikeyann
  • move input generation for type dispatcher benchmark to device (#10280) @karthikeyann
  • move input generation for copy benchmark to device (#10279) @karthikeyann
  • generate url decode benchmark input in device (#10278) @karthikeyann
  • device input generation in join bench (#10277) @karthikeyann
  • Add nvtext::bytepairencoding API (#10270) @davidwendt
  • Prevent internal usage of expensive APIs (#10263) @vyasr
  • Column to JCUDF row for tables with strings (#10235) @hyperbolic2346
  • Support percent_rank() aggregation (#10227) @mythrocks
  • Refactor Series.array_ufunc (#10217) @vyasr
  • Reduce pytest runtime (#10203) @brandon-b-miller
  • Add regex flags parameter to python cudf strings split (#10185) @davidwendt
  • Support for MOD, PMOD and PYMOD for decimal32/64/128 (#10179) @codereport
  • Adding string row size iterator for row to column and column to row conversion (#10157) @hyperbolic2346
  • Add file size counter to cuIO benchmarks (#10154) @vuule
  • byterange support for multibytesplit/read_text (#10150) @cwharris
  • Add JNI for strings::split_re and strings::split_record_re (#10139) @ttnghia
  • Add maxSplit parameter to Java binding for strings:split (#10137) @ttnghia
  • Add libcudf strings split API that accepts regex pattern (#10128) @davidwendt
  • generate benchmark input in device (#10109) @karthikeyann
  • Avoid nan_as_null op if nan_count is 0 (#10082) @galipremsagar
  • Add Dataframe and Index nunique (#10077) @martinfalisse
  • Support nanosecond timestamps in parquet (#10063) @PointKernel
  • Java bindings for mixed semi and anti joins (#10040) @jlowe
  • Implement mixed equality/conditional semi/anti joins (#10037) @vyasr
  • Optimize compaction operations (#10030) @PointKernel
  • Support args= in Series.apply (#9982) @brandon-b-miller
  • Add cudf::strings::findall_record API (#9911) @davidwendt
  • Add covariance for sort groupby (python) (#9889) @mayankanand007
  • Implement DataFrame diff() (#9817) @skirui-source
  • Implement DataFrame pct_change (#9805) @skirui-source
  • Support segmented reductions and null mask reductions (#9621) @isVoid
  • Add 'spearman' correlation method for dataframe.corr and series.corr (#7141) @dominicshanshan

πŸ› οΈ Improvements

  • Add scipy skip for a test (#10502) @galipremsagar
  • Temporarily disable new ops-bot functionality (#10496) @ajschmidt8
  • Include <cstddef> to fix compilation of parquet reader on GCC 11. (#10483) @bdice
  • Pin dask and distributed (#10481) @galipremsagar
  • MD5 refactoring. (#10445) @bdice
  • Remove or split up Frame methods that use the index (#10439) @vyasr
  • Centralization of tdigest aggregation code. (#10422) @nvdbaranec
  • Simplify column binary operations (#10421) @vyasr
  • Add .github/ops-bot.yaml config file (#10420) @ajschmidt8
  • Use list of columns for methods in Groupby.pyx (#10419) @isVoid
  • Remove warnings in test_timedelta.py (#10418) @galipremsagar
  • Fix some warnings in test_parquet.py (#10416) @galipremsagar
  • JNI support for segmented reduce (#10413) @revans2
  • Clean up null mask after purging null entries (#10412) @sperlingxx
  • Drop unsupported method argument from nunique and distinct_count. (#10411) @bdice
  • Use str instead of builtins.str. (#10410) @bdice
  • Fix warnings in test_rolling (#10405) @bdice
  • Enable codecov github-check in CI (#10404) @galipremsagar
  • Fix warnings in testcudaapply, testnumerical, testpickling, test_unaops. (#10402) @bdice
  • Set column names in _from_columns_like_self factory (#10400) @isVoid
  • Refactor nvtx annotations in cudf & dask-cudf (#10396) @galipremsagar
  • Consolidate .cov and .corr for sort groupby (#10386) @skirui-source
  • Consolidate some Frame APIs (#10381) @vyasr
  • Refactor hash functions and hash_combine (#10379) @bdice
  • Add nvtx annotations for Series and Index (#10374) @galipremsagar
  • Refactor filling.repeat API (#10371) @isVoid
  • Move standalone UTF8 functions from string_view.hpp to utf8.hpp (#10369) @davidwendt
  • Remove doc for deprecated function one_hot_encoding (#10367) @isVoid
  • Refactor array function (#10364) @vyasr
  • Fix warnings in test_csv.py. (#10362) @bdice
  • Implement a mixin for binops (#10360) @vyasr
  • Refactor cython interface: copying.pyx (#10359) @isVoid
  • Implement a mixin for scans (#10358) @vyasr
  • Add scanaggregation and reduceaggregation derived types. (#10357) @nvdbaranec
  • Add cleanup of python artifacts (#10355) @galipremsagar
  • Fix warnings in test_categorical.py. (#10354) @bdice
  • Create a dispatcher for invoking regex kernel functions (#10349) @davidwendt
  • Fix codecov in CI (#10347) @galipremsagar
  • Enable caching for memory_usage calculation in Column (#10345) @galipremsagar
  • C++17 cleanup: traits replace std::enableif<>::type with std::enableif_t (#10343) @karthikeyann
  • JNI: Support appending DECIMAL128 into ColumnBuilder in terms of byte array (#10338) @sperlingxx
  • multibyte_split test improvements (#10328) @vuule
  • Fix warnings in test_binops.py. (#10327) @bdice
  • Fix warnings from pandas in testarrayufunc.py. (#10324) @bdice
  • Update upload script (#10321) @ajschmidt8
  • Move hash type declarations to hashing.hpp (#10320) @davidwendt
  • C++17 cleanup: traits replace ::value with _v (#10319) @karthikeyann
  • Remove internal columns usage (#10315) @vyasr
  • Remove extraneous build.sh parameter (#10313) @ajschmidt8
  • Add const qualifier to MurmurHash332::hashcombine (#10311) @davidwendt
  • Remove TODO in libcudf_kafka recipe (#10309) @ajschmidt8
  • Add conversions between columnview and devicespan<T const>. (#10302) @bdice
  • Avoid decimal type narrowing for decimal binops (#10299) @galipremsagar
  • Deprecate DataFrame.iteritems and introduce .items (#10298) @galipremsagar
  • Explicitly request CMake use gnu++17 over c++17 (#10297) @robertmaynard
  • Add copyright check as pre-commit hook. (#10290) @vyasr
  • DataFrame insert and creation optimizations (#10285) @galipremsagar
  • Improve hash join detail functions (#10273) @PointKernel
  • Replace custom cached_property implementation with functools (#10272) @shwina
  • Rewrites sample API (#10262) @isVoid
  • Bump hadoop-common from 3.1.0 to 3.1.4 in /java (#10259) @dependabot[bot]
  • Remove making redundant copy across code-base (#10257) @galipremsagar
  • Add more nvtx annotations (#10256) @galipremsagar
  • Add copyright check in cudf (#10253) @galipremsagar
  • Remove redundant copies in fillna to improve performance (#10241) @galipremsagar
  • Remove std::numeric_limit specializations for timestamp & durations (#10239) @codereport
  • Optimize DataFrame creation across code-base (#10236) @galipremsagar
  • Change pytest distribution algorithm and increase parallelism in CI (#10232) @galipremsagar
  • Add environment variables for I/O thread pool and slice sizes (#10218) @vuule
  • Add regex flags to strings findall functions (#10208) @davidwendt
  • Update dask-cudf parquet tests to reflect upstream bugfixes to _metadata (#10206) @charlesbluca
  • Remove unnecessary nunique function in Series. (#10205) @martinfalisse
  • Refactor DataFrame tests. (#10204) @bdice
  • Rewrites column.__setitem__, Use boolean_mask_scatter (#10202) @isVoid
  • Java utilities to aid in accelerating aggregations on 128-bit types (#10201) @jlowe
  • Fix docstrings alignment in Frame methods (#10199) @galipremsagar
  • Fix cuco pair issue in hash join (#10195) @PointKernel
  • Replace dask groupby .index usages with .by (#10193) @galipremsagar
  • Add regex flags to strings extract function (#10192) @davidwendt
  • Forward-merge branch-22.02 to branch-22.04 (#10191) @bdice
  • Add CMake install rule for tests (#10190) @ajschmidt8
  • Unpin dask & distributed (#10182) @galipremsagar
  • Add comments to explain test validation (#10176) @galipremsagar
  • Reduce warnings in pytest output (#10168) @bdice
  • Some consolidation of indexed frame methods (#10167) @vyasr
  • Refactor isin implementations (#10165) @vyasr
  • Faster struct row comparator (#10164) @devavret
  • Refactor groupby::get_groups. (#10161) @bdice
  • Deprecate decimal_cols_as_float in ORC reader (C++ layer) (#10152) @vuule
  • Replace ccache with sccache (#10146) @ajschmidt8
  • Murmur3 hash kernel cleanup (#10143) @rwlee
  • Deprecate decimal_cols_as_float in ORC reader (#10142) @galipremsagar
  • Run pyupgrade 2.31.0. (#10141) @bdice
  • Remove drop_nan from internal IndexedFrame._drop_na_rows. (#10140) @bdice
  • Change cudf::strings::find_multiple to return a lists column (#10134) @davidwendt
  • Update cmake-format script for branch 22.04. (#10132) @bdice
  • Accept r-value references in converttablefor_return(): (#10131) @mythrocks
  • Remove the option to completely disable decimal128 columns in the ORC reader (#10127) @vuule
  • Remove deprecated code (#10124) @vyasr
  • Update gpu_utils.py to reflect current CUDA support. (#10113) @bdice
  • Remove benchmarks suffix (#10112) @bdice
  • Update cudf java binding version to 22.04.0-SNAPSHOT (#10084) @pxLi
  • Remove unnecessary docker files. (#10069) @vyasr
  • Limit benchmark iterations using environment variable (#10060) @karthikeyann
  • Add timing chart for libcudf build metrics report page (#10038) @davidwendt
  • JNI: Rewrite growBuffersAndRows to accelerate the HostColumnBuilder (#10025) @sperlingxx
  • Reduce redundant code in CUDF JNI (#10019) @mythrocks
  • Make snappy decompress check more efficient (#9995) @cheinger
  • Remove deprecated method Series.set_index. (#9945) @bdice
  • Implement a mixin for reductions (#9925) @vyasr
  • JNI: Push back decimal utils from spark-rapids (#9907) @sperlingxx
  • Add assert_column_memory_* (#9882) @isVoid
  • Add CUDF_UNREACHABLE macro. (#9727) @bdice
  • Upgrade arrow & pyarrow to 6.0.1 (#9686) @galipremsagar

- C++
Published by GPUtester almost 4 years ago

https://github.com/rapidsai/cudf - v22.02.00

🚨 Breaking Changes

  • ORC writer API changes for granular statistics (#10058) @mythrocks
  • decimal128 Support for to/from_arrow (#9986) @codereport
  • Remove deprecated method one_hot_encoding (#9977) @isVoid
  • Remove str.subword_tokenize (#9968) @VibhuJawa
  • Remove deprecated method parameter from merge and join. (#9944) @bdice
  • Remove deprecated method DataFrame.hash_columns. (#9943) @bdice
  • Remove deprecated method Series.hash_encode. (#9942) @bdice
  • Refactoring ceil/round/floor code for datetime64 types (#9926) @mayankanand007
  • Introduce nan_as_null parameter for cudf.Index (#9893) @galipremsagar
  • Add regexflags parameter to strings replacere functions (#9878) @davidwendt
  • Break tie for top categorical columns in Series.describe (#9867) @isVoid
  • Add partitioning support in parquet writer (#9810) @devavret
  • Move drop_duplicates, drop_na, _gather, take to IndexFrame and create their _base_index counterparts (#9807) @isVoid
  • Raise temporary error for decimal128 types in parquet reader (#9804) @galipremsagar
  • Change default dtype of all nulls column from float to object (#9803) @galipremsagar
  • Remove unused masked udf cython/c++ code (#9792) @brandon-b-miller
  • Pick smallest decimal type with required precision in ORC reader (#9775) @vuule
  • Add decimal128 support to Parquet reader and writer (#9765) @vuule
  • Refactor TableTest assertion methods to a separate utility class (#9762) @jlowe
  • Use cuFile direct device reads/writes by default in cuIO (#9722) @vuule
  • Match pandas scalar result types in reductions (#9717) @brandon-b-miller
  • Add parameters to control row group size in Parquet writer (#9677) @vuule
  • Refactor bit counting APIs, introduce valid/null count functions, and split host/device side code for segmented counts. (#9588) @bdice
  • Add support for decimal128 in cudf python (#9533) @galipremsagar
  • Implement lists::index_of() to find positions in list rows (#9510) @mythrocks
  • Rewriting row/column conversions for Spark <-> cudf data conversions (#8444) @hyperbolic2346

πŸ› Bug Fixes

  • Add check for negative stripe index in ORC reader (#10074) @vuule
  • Update Java tests to expect DECIMAL128 from Arrow (#10073) @jlowe
  • Avoid index materialization when DataFrame is created with un-named Series objects (#10071) @galipremsagar
  • fix gcc 11 compilation errors (#10067) @rongou
  • Fix columns ordering issue in parquet reader (#10066) @galipremsagar
  • Fix dataframe setitem with ndarray types (#10056) @galipremsagar
  • Remove implicit copy due to conversion from cudf::sizetype and sizet (#10045) @robertmaynard
  • Include <optional> in headers that use std::optional (#10044) @robertmaynard
  • Fix repr and concat of StructColumn (#10042) @galipremsagar
  • Include row group level stats when writing ORC files (#10041) @vuule
  • build.sh respects the --build_metrics and --incl_cache_stats flags (#10035) @robertmaynard
  • Fix memory leaks in JNI native code. (#10029) @mythrocks
  • Update JNI to use new arena mr constructor (#10027) @rongou
  • Fix null check when comparing structs in arg_min operation of reduction/groupby (#10026) @ttnghia
  • Wrap CI script shell variables in quotes to fix local testing. (#10018) @bdice
  • cudftestutil no longer propagates compiler flags to external users (#10017) @robertmaynard
  • Remove CUDA_DEVICE_CALLABLE macro usage (#10015) @hyperbolic2346
  • Add missing list filling header in meta.yaml (#10007) @devavret
  • Fix conda recipes for custreamz & cudf_kafka (#10003) @ajschmidt8
  • Fix matching regex word-boundary (\b) in strings replace (#9997) @davidwendt
  • Fix null check when comparing structs in min and max reduction/groupby operations (#9994) @ttnghia
  • Fix octal pattern matching in regex string (#9993) @davidwendt
  • decimal128 Support for to/from_arrow (#9986) @codereport
  • Fix groupby shift/diff/fill after selecting from a GroupBy (#9984) @shwina
  • Fix the overflow problem of decimal rescale (#9966) @sperlingxx
  • Use default value for decimal precision in parquet writer when not specified (#9963) @devavret
  • Fix cudf java build error. (#9958) @firestarman
  • Use gpucimambaretry to install local artifacts. (#9951) @bdice
  • Fix regression HostColumnVectorCore requiring native libs (#9948) @jlowe
  • Rename aggregate_metadata in writer to fix name collision (#9938) @devavret
  • Fixed issue with percentile_approx where output tdigests could have uninitialized data at the end. (#9931) @nvdbaranec
  • Resolve racecheck errors in ORC kernels (#9916) @vuule
  • Fix the java build after parquet partitioning support (#9908) @revans2
  • Fix compilation of benchmark for parquet writer. (#9905) @bdice
  • Fix a memcheck error in ORC writer (#9896) @vuule
  • Introduce nan_as_null parameter for cudf.Index (#9893) @galipremsagar
  • Fix fallback to sort aggregation for grouping only hash aggregate (#9891) @abellina
  • Add zlib to cudfjni link when using static libcudf library dependency (#9890) @jlowe
  • TimedeltaIndex constructor raises an AttributeError. (#9884) @skirui-source
  • Fix cudf.Scalar string datetime construction (#9875) @brandon-b-miller
  • Load libcufile.so with RTLD_NODELETE flag (#9872) @vuule
  • Break tie for top categorical columns in Series.describe (#9867) @isVoid
  • Fix null handling for structs min and arg_min in groupby, groupby scan, reduction, and inclusive_scan (#9864) @ttnghia
  • Add one-level list encoding support in parquet reader (#9848) @PointKernel
  • Fix an out-of-bounds read in validity copying in contiguous_split. (#9842) @nvdbaranec
  • Fix join of MultiIndex to Index with one column and overlapping name. (#9830) @vyasr
  • Fix caching in Series.applymap (#9821) @brandon-b-miller
  • Enforce boolean ascending for dask-cudf sort_values (#9814) @charlesbluca
  • Fix ORC writer crash with empty input columns (#9808) @vuule
  • Change default dtype of all nulls column from float to object (#9803) @galipremsagar
  • Load native dependencies when Java ColumnView is loaded (#9800) @jlowe
  • Fix dtype-argument bug in daskcudf readcsv (#9796) @rjzamora
  • Fix overflow for min calculation in strings::from_timestamps (#9793) @revans2
  • Fix memory error due to lambda return type deduction limitation (#9778) @karthikeyann
  • Revert regex $/EOL end-of-string new-line special case handling (#9774) @davidwendt
  • Fix missing streams (#9767) @karthikeyann
  • Fix makeemptyscalarlike on listtype (#9759) @sperlingxx
  • Update cmake and conda to 22.02 (#9746) @devavret
  • Fix out-of-bounds memory write in decimal128-to-string conversion (#9740) @davidwendt
  • Match pandas scalar result types in reductions (#9717) @brandon-b-miller
  • Fix regex non-multiline EOL/$ matching strings ending with a new-line (#9715) @davidwendt
  • Fixed build by adding more checks for int8, int16 (#9707) @razajafri
  • Fix null handling when boolean dtype is passed (#9691) @galipremsagar
  • Fix stream usage in segmented_gather() (#9679) @mythrocks

πŸ“– Documentation

  • Update decimal dtypes related docs entries (#10072) @galipremsagar
  • Fix regex doc describing hexadecimal escape characters (#10009) @davidwendt
  • Fix cudf compilation instructions. (#9956) @esoha-nvidia
  • Fix see also links for IO APIs (#9895) @galipremsagar
  • Fix build instructions for libcudf doxygen (#9837) @davidwendt
  • Fix some doxygen warnings and add missing documentation (#9770) @karthikeyann
  • update cuda version in local build (#9736) @karthikeyann
  • Fix doxygen for enum types in libcudf (#9724) @davidwendt
  • Spell check fixes (#9682) @karthikeyann
  • Fix links in C++ Developer Guide. (#9675) @bdice

πŸš€ New Features

  • Remove libcudacxx patch needed for nvcc 11.4 (#10057) @robertmaynard
  • Allow CuPy 10 (#10048) @jakirkham
  • Add in support for NULLLOGICALAND and NULLLOGICALOR binops (#10016) @revans2
  • Add groupby.transform (only support for aggregations) (#10005) @shwina
  • Add partitioning support to Parquet chunked writer (#10000) @devavret
  • Add jni for sequences (#9972) @wbo4958
  • Java bindings for mixed left, inner, and full joins (#9941) @jlowe
  • Java bindings for JSON reader support (#9940) @wbo4958
  • Enable transpose for string columns in cudf python (#9937) @galipremsagar
  • Support structs for cudf::contains with column/scalar input (#9929) @ttnghia
  • Implement mixed equality/conditional joins (#9917) @vyasr
  • Add cudf::strings::extract_all API (#9909) @davidwendt
  • Implement JNI for cudf::scatter APIs (#9903) @ttnghia
  • JNI: Function to copy and set validity from bool column. (#9901) @mythrocks
  • Add dictionary support to cudf::copyifelse (#9887) @davidwendt
  • add run_benchmarks target for running benchmarks with json output (#9879) @karthikeyann
  • Add regexflags parameter to strings replacere functions (#9878) @davidwendt
  • Addsuffix and addprefix for DataFrames and Series (#9846) @mayankanand007
  • Add JNI for cudf::drop_duplicates (#9841) @ttnghia
  • Implement per-list sequence (#9839) @ttnghia
  • adding series.transpose (#9835) @mayankanand007
  • Adding support for Series.autocorr (#9833) @mayankanand007
  • Support round operation on datetime64 datatypes (#9820) @mayankanand007
  • Add partitioning support in parquet writer (#9810) @devavret
  • Raise temporary error for decimal128 types in parquet reader (#9804) @galipremsagar
  • Add decimal128 support to Parquet reader and writer (#9765) @vuule
  • Optimize groupby::scan (#9754) @PointKernel
  • Add sample JNI API (#9728) @res-life
  • Support min and max in inclusive scan for structs (#9725) @ttnghia
  • Add first and last method to IndexedFrame (#9710) @isVoid
  • Support min and max reduction for structs (#9697) @ttnghia
  • Add parameters to control row group size in Parquet writer (#9677) @vuule
  • Run compute-sanitizer in nightly build (#9641) @karthikeyann
  • Implement Series.datetime.floor (#9571) @skirui-source
  • ceil/floor for DatetimeIndex (#9554) @mayankanand007
  • Add support for decimal128 in cudf python (#9533) @galipremsagar
  • Implement lists::index_of() to find positions in list rows (#9510) @mythrocks
  • custreamz oauth callback for kafka (librdkafka) (#9486) @jdye64
  • Add Pearson correlation for sort groupby (python) (#9166) @skirui-source
  • Interchange dataframe protocol (#9071) @iskode
  • Rewriting row/column conversions for Spark <-> cudf data conversions (#8444) @hyperbolic2346

πŸ› οΈ Improvements

  • Prepare upload scripts for Python 3.7 removal (#10092) @Ethyling
  • Simplify custreamz and cudf_kafka recipes files (#10065) @Ethyling
  • ORC writer API changes for granular statistics (#10058) @mythrocks
  • Remove python constraints in cutreamz and cudf_kafka recipes (#10052) @Ethyling
  • Unpin dask and distributed in CI (#10028) @galipremsagar
  • Add _from_column_like_self factory (#10022) @isVoid
  • Replace custom CUDA bindings previously provided by RMM with official CUDA Python bindings (#10008) @shwina
  • Use cuda::std::is_arithmetic in cudf::is_numeric trait. (#9996) @bdice
  • Clean up CUDA stream use in cuIO (#9991) @vuule
  • Use addressed-ordered first fit for the pinned memory pool (#9989) @rongou
  • Add strings tests to transpose_test.cpp (#9985) @davidwendt
  • Use gpucimambaretry on Java CI. (#9983) @bdice
  • Remove deprecated method one_hot_encoding (#9977) @isVoid
  • Minor cleanup of unused Python functions (#9974) @vyasr
  • Use new efficient partitioned parquet writing in cuDF (#9971) @devavret
  • Remove str.subword_tokenize (#9968) @VibhuJawa
  • Forward-merge branch-21.12 to branch-22.02 (#9947) @bdice
  • Remove deprecated method parameter from merge and join. (#9944) @bdice
  • Remove deprecated method DataFrame.hash_columns. (#9943) @bdice
  • Remove deprecated method Series.hash_encode. (#9942) @bdice
  • use ninja in java ci build (#9933) @rongou
  • Add build-time publish step to cpu build script (#9927) @davidwendt
  • Refactoring ceil/round/floor code for datetime64 types (#9926) @mayankanand007
  • Remove various unused functions (#9922) @vyasr
  • Raise in query if dtype is not supported (#9921) @brandon-b-miller
  • Add missing imports tests (#9920) @Ethyling
  • Spark Decimal128 hashing (#9919) @rwlee
  • Replace thrust/std::get with structured bindings (#9915) @codereport
  • Upgrade thrust version to 1.15 (#9912) @robertmaynard
  • Remove conda envs for CUDA 11.0 and 11.2. (#9910) @bdice
  • Return count of set bits from inplacebitmaskand. (#9904) @bdice
  • Use dynamic nullate for join hasher and equality comparator (#9902) @davidwendt
  • Update ucx-py version on release using rvc (#9897) @Ethyling
  • Remove IncludeCategories from .clang-format (#9876) @codereport
  • Support statically linking CUDA runtime for Java bindings (#9873) @jlowe
  • Add clang-tidy to libcudf (#9860) @codereport
  • Remove deprecated methods from Java Table class (#9853) @jlowe
  • Add test for map column metadata handling in ORC writer (#9852) @vuule
  • Use pandas to_offset to parse frequency string in date_range (#9843) @isVoid
  • add templated benchmark with fixture (#9838) @karthikeyann
  • Use list of column inputs for apply_boolean_mask (#9832) @isVoid
  • Added a few more tests for Decimal to String cast (#9818) @razajafri
  • Run doctests. (#9815) @bdice
  • Avoid overflow for fixed_point round (#9809) @sperlingxx
  • Move drop_duplicates, drop_na, _gather, take to IndexFrame and create their _base_index counterparts (#9807) @isVoid
  • Use vector factories for host-device copies. (#9806) @bdice
  • Refactor host device macros (#9797) @vyasr
  • Remove unused masked udf cython/c++ code (#9792) @brandon-b-miller
  • Allow custom sort functions for dask-cudf sort_values (#9789) @charlesbluca
  • Improve build time of libcudf iterator tests (#9788) @davidwendt
  • Copy Java native dependencies directly into classpath (#9787) @jlowe
  • Add decimal types to cuIO benchmarks (#9776) @vuule
  • Pick smallest decimal type with required precision in ORC reader (#9775) @vuule
  • Avoid overflow for fixed_point cudf::cast and performance optimization (#9772) @codereport
  • Use CTAD with Thrust function objects (#9768) @codereport
  • Refactor TableTest assertion methods to a separate utility class (#9762) @jlowe
  • Use Java classloader to find test resources (#9760) @jlowe
  • Allow cast decimal128 to string and add tests (#9756) @razajafri
  • Load balance optimization for contiguous_split (#9755) @nvdbaranec
  • Consolidate and improve reset_index (#9750) @isVoid
  • Update to UCX-Py 0.24 (#9748) @pentschev
  • Skip cufile tests in JNI build script (#9744) @pxLi
  • Enable string to decimal 128 cast (#9742) @razajafri
  • Use stop instead of stop_. (#9735) @bdice
  • Forward-merge branch-21.12 to branch-22.02 (#9730) @bdice
  • Improve cmake format script (#9723) @vyasr
  • Use cuFile direct device reads/writes by default in cuIO (#9722) @vuule
  • Add directory-partitioned data support to cudf.read_parquet (#9720) @rjzamora
  • Use stream allocator adaptor for hash join table (#9704) @PointKernel
  • Update check for inf/nan strings in libcudf float conversion to ignore case (#9694) @davidwendt
  • Update cudf JNI to 22.02.0-SNAPSHOT (#9681) @pxLi
  • Replace cudf's concurrentorderedmap with cuco::static_map in semi/anti joins (#9666) @vyasr
  • Some improvements to parse_decimal function and bindings for is_fixed_point (#9658) @razajafri
  • Add utility to format ninja-log build times (#9631) @davidwendt
  • Allow runtime has_nulls parameter for row operators (#9623) @davidwendt
  • Use fsspec.parquet for improved read_parquet performance from remote storage (#9589) @rjzamora
  • Refactor bit counting APIs, introduce valid/null count functions, and split host/device side code for segmented counts. (#9588) @bdice
  • Use List of Columns as Input for drop_nulls, gather and drop_duplicates (#9558) @isVoid
  • Simplify merge internals and reduce overhead (#9516) @vyasr
  • Add struct generation support in datagenerator & fuzz tests (#9180) @galipremsagar
  • Simplify write_csv by removing unnecessary writer/impl classes (#9089) @cwharris

- C++
Published by GPUtester about 4 years ago

https://github.com/rapidsai/cudf - v21.12.02

v21.12.02

- C++
Published by GPUtester about 4 years ago

https://github.com/rapidsai/cudf - v21.12.01

v21.12.01

- C++
Published by GPUtester about 4 years ago

https://github.com/rapidsai/cudf - v21.12.00

🚨 Breaking Changes

  • Update bitmask_and and bitmask_or to return a pair of resulting mask and count of unset bits (#9616) @PointKernel
  • Remove sizeof and standardize on memory_usage (#9544) @vyasr
  • Add support for single-line regex anchors ^/$ in contains_re (#9482) @davidwendt
  • Refactor sorting APIs (#9464) @vyasr
  • Update Java nvcomp JNI bindings to nvcomp 2.x API (#9384) @jbrennan333
  • Support Python UDFs written in terms of rows (#9343) @brandon-b-miller
  • JNI: Support nested types in ORC writer (#9334) @firestarman
  • Optionally nullify out-of-bounds indices in segmented_gather(). (#9318) @mythrocks
  • Refactor cuIO timestamp processing with cuda::std::chrono (#9278) @PointKernel
  • Various internal MultiIndex improvements (#9243) @vyasr

πŸ› Bug Fixes

  • Fix read_parquet bug for bytes input (#9669) @rjzamora
  • Use _gather internal for sort_* (#9668) @isVoid
  • Fix behavior of equals for non-DataFrame Frames and add tests. (#9653) @vyasr
  • Dont recompute output size if it is already available (#9649) @abellina
  • Fix read_parquet bug for extended dtypes from remote storage (#9638) @rjzamora
  • add const when getting data from a JNI data wrapper (#9637) @wjxiz1992
  • Fix debrotli issue on CUDA 11.5 (#9632) @vuule
  • Use std::size_t when computing join output size (#9626) @jlowe
  • Fix usecols parameter handling in dask_cudf.read_csv (#9618) @galipremsagar
  • Add support for string &#39;nan&#39;, &#39;inf&#39; &amp; &#39;-inf&#39; values while type-casting to float (#9613) @galipremsagar
  • Avoid passing NativeFileDatasource to pyarrow in read_parquet (#9608) @rjzamora
  • Fix test failure with cuda 11.5 in rowbitcount tests. (#9581) @nvdbaranec
  • Correct LIBCUDACXXCUDACC_VER value computation (#9579) @robertmaynard
  • Increase max RLE stream size estimate to avoid potential overflows (#9568) @vuule
  • Fix edge case in tdigest scalar generation for groups containing all nulls. (#9551) @nvdbaranec
  • Fix pytests failing in cuda-11.5 environment (#9547) @galipremsagar
  • compile libnvcomp with PTDS if requested (#9540) @jbrennan333
  • Fix segmented_gather() for null LIST rows (#9537) @mythrocks
  • Deprecate DataFrame.labelencoding, use private _labelencoding method internally. (#9535) @bdice
  • Fix several test and benchmark issues related to bitmask allocations. (#9521) @nvdbaranec
  • Fix for inserting duplicates in groupby result cache (#9508) @karthikeyann
  • Fix mismatched types error in clip() when using non int64 numeric types (#9498) @davidwendt
  • Match conda pinnings for style checks (revert part of #9412, #9433). (#9490) @bdice
  • Make sure all dask-cudf supported aggs are handled in _tree_node_agg (#9487) @charlesbluca
  • Resolve hash_columns FutureWarning in dask_cudf (#9481) @pentschev
  • Add fixed point to AllTypes in libcudf unit tests (#9472) @karthikeyann
  • Fix regex handling of embedded null characters (#9470) @davidwendt
  • Fix memcheck error in copy-if-else (#9467) @davidwendt
  • Fix bug in daskcudf.readparquet for index=False (#9453) @rjzamora
  • Preserve the decimal scale when creating a default scalar (#9449) @revans2
  • Push down parent nulls when flattening nested columns. (#9443) @mythrocks
  • Fix memcheck error in gtest SegmentedGatherTest/GatherSliced (#9442) @davidwendt
  • Revert "Fix quantile division / partition handling for dask-cudf sort… (#9438) @charlesbluca
  • Allow int-like objects for the decimals argument in round (#9428) @shwina
  • Fix stream compaction's drop_duplicates API to use stable sort (#9417) @ttnghia
  • Skip Comparing Uniform Window Results in Var/std Tests (#9416) @isVoid
  • Fix StructColumn.to_pandas type handling issues (#9388) @galipremsagar
  • Correct issues in the build dir cudf-config.cmake (#9386) @robertmaynard
  • Fix Java table partition test to account for non-deterministic ordering (#9385) @jlowe
  • Fix timestamp truncation/overflow bugs in orc/parquet (#9382) @PointKernel
  • Fix the crash in stats code (#9368) @devavret
  • Make Series.hash_encode results reproducible. (#9366) @bdice
  • Fix libcudf compile warnings on debug 11.4 build (#9360) @davidwendt
  • Fail gracefully when compiling python UDFs that attempt to access columns with unsupported dtypes (#9359) @brandon-b-miller
  • Set pass_filenames: false in mypy pre-commit configuration. (#9349) @bdice
  • Fix cudf_assert in cudf::io::orc::gpu::gpuDecodeOrcColumnData (#9348) @davidwendt
  • Fix memcheck error in groupby-tdigest getscalarminmax (#9339) @davidwendt
  • Optimizations for cudf.concat when axis=1 (#9333) @galipremsagar
  • Use f-string in join helper warning message. (#9325) @bdice
  • Avoid casting to list or struct dtypes in daskcudf.readparquet (#9314) @rjzamora
  • Fix null count in statistics for parquet (#9303) @devavret
  • Potential overflow of decimal32 when casting to int64_t (#9287) @codereport
  • Fix quantile division / partition handling for dask-cudf sort on null dataframes (#9259) @charlesbluca
  • Updating cudf version also updates rapids cmake branch (#9249) @robertmaynard
  • Implement one_hot_encoding in libcudf and bind to python (#9229) @isVoid
  • BUG FIX: CSV Writer ignores the header parameter when no metadata is provided (#8740) @skirui-source

πŸ“– Documentation

  • Update Documentation to use TYPED_TEST_SUITE (#9654) @codereport
  • Add dedicated page for StringHandling in python docs (#9624) @galipremsagar
  • Update docstring of DataFrame.merge (#9572) @galipremsagar
  • Use raw strings to avoid SyntaxErrors in parsed docstrings. (#9526) @bdice
  • Add example to docstrings in rolling.apply (#9522) @isVoid
  • Update help message to escape quotes in ./build.sh --cmake-args. (#9494) @bdice
  • Improve Python docstring formatting. (#9493) @bdice
  • Update table of I/O supported types (#9476) @vuule
  • Document invalid regex patterns as undefined behavior (#9473) @davidwendt
  • Miscellaneous documentation fixes to cudf (#9471) @galipremsagar
  • Fix many documentation errors in libcudf. (#9355) @karthikeyann
  • Fixing SubwordTokenizer docs issue (#9354) @mayankanand007
  • Improved deprecation warnings. (#9347) @bdice
  • doc reorder mr, stream to stream, mr (#9308) @karthikeyann
  • Deprecate method parameters to DataFrame.join, DataFrame.merge. (#9291) @bdice
  • Added deprecation warning for .label_encoding() (#9289) @mayankanand007

πŸš€ New Features

  • Enable Series.divide and DataFrame.divide (#9630) @vyasr
  • Update bitmask_and and bitmask_or to return a pair of resulting mask and count of unset bits (#9616) @PointKernel
  • Add handling of mixed numeric types in to_dlpack (#9585) @galipremsagar
  • Support re.Pattern object for pat arg in str.replace (#9573) @davidwendt
  • Add JNI for lists::drop_list_duplicates with keys-values input column (#9553) @ttnghia
  • Support structs column in min, max, argmin and argmax groupby aggregate() and scan() (#9545) @ttnghia
  • Move libcudacxx to use rapids_cpm and use newer versions (#9539) @robertmaynard
  • Add scan min/max support for chrono types to libcudf reduction-scan (not groupby scan) (#9518) @davidwendt
  • Support args= in apply (#9514) @brandon-b-miller
  • Add groupby scan min/max support for strings values (#9502) @davidwendt
  • Add list output option to character_ngrams() function (#9499) @davidwendt
  • More granular column selection in ORC reader (#9496) @vuule
  • add min_periods, ddof to groupby covariance, & correlation aggregation (#9492) @karthikeyann
  • Implement Series.datetime.floor (#9488) @skirui-source
  • Enable linting of CMake files using pre-commit (#9484) @vyasr
  • Add support for single-line regex anchors ^/$ in contains_re (#9482) @davidwendt
  • Augment order_by to Accept a List of null_precedence (#9455) @isVoid
  • Add format API for list column of strings (#9454) @davidwendt
  • Enable Datetime/Timedelta dtypes in Masked UDFs (#9451) @brandon-b-miller
  • Add cudf python groupby.diff (#9446) @karthikeyann
  • Implement lists::stable_sort_lists for stable sorting of elements within each row of lists column (#9425) @ttnghia
  • add ctest memcheck using cuda-sanitizer (#9414) @karthikeyann
  • Support Unary Operations in Masked UDF (#9409) @isVoid
  • Move Several Series Function to Frame (#9394) @isVoid
  • MD5 Python hash API (#9390) @bdice
  • Add cudf strings is_title API (#9380) @davidwendt
  • Enable casting to int64, uint64, and double in AST code. (#9379) @vyasr
  • Add support for writing ORC with map columns (#9369) @vuule
  • extractlistelements() with column_view indices (#9367) @mythrocks
  • Reimplement lists::drop_list_duplicates for keys-values lists columns (#9345) @ttnghia
  • Support Python UDFs written in terms of rows (#9343) @brandon-b-miller
  • JNI: Support nested types in ORC writer (#9334) @firestarman
  • Optionally nullify out-of-bounds indices in segmented_gather(). (#9318) @mythrocks
  • Add shallow hash function and shallow equality comparison for column_view (#9312) @karthikeyann
  • Add CudaMemoryBuffer for cudaMalloc memory using RMM cudamemoryresource (#9311) @rongou
  • Add parameters to control row index stride and stripe size in ORC writer (#9310) @vuule
  • Add na_position param to dask-cudf sort_values (#9264) @charlesbluca
  • Add ascending parameter for dask-cudf sort_values (#9250) @charlesbluca
  • New array conversion methods (#9236) @vyasr
  • Series apply method backed by masked UDFs (#9217) @brandon-b-miller
  • Grouping by frequency and resampling (#9178) @shwina
  • Pure-python masked UDFs (#9174) @brandon-b-miller
  • Add Covariance, Pearson correlation for sort groupby (libcudf) (#9154) @karthikeyann
  • Add calendrical_month_sequence in c++ and date_range in python (#8886) @shwina

πŸ› οΈ Improvements

  • Followup to PR 9088 comments (#9659) @cwharris
  • Update cuCollections to version that supports installed libcudacxx (#9633) @robertmaynard
  • Add 11.5 dev.yml to cudf (#9617) @galipremsagar
  • Add xfail for parquet reader 11.5 issue (#9612) @galipremsagar
  • remove deprecated Rmm.initialize method (#9607) @rongou
  • Use HostColumnVectorCore for child columns in JCudfSerialization.unpackHostColumnVectors (#9596) @sperlingxx
  • Set RMM pool to a fixed size in JNI (#9583) @rongou
  • Use nvCOMP for Snappy compression/decompression (#9582) @vuule
  • Build CUDA version agnostic packages for dask-cudf (#9578) @Ethyling
  • Fixed tests warning: "TYPEDTESTCASE is deprecated, please use TYPEDTESTSUITE" (#9574) @ttnghia
  • Enable CMake format in CI and fix style (#9570) @vyasr
  • Add NVTX Start/End Ranges to JNI (#9563) @abellina
  • Add librdkafka and python-confluent-kafka to dev conda environments s… (#9562) @jdye64
  • Add offsetsbegin/end() to stringscolumn_view (#9559) @davidwendt
  • remove alignment options for RMM jni (#9550) @rongou
  • Add axis parameter passthrough to DataFrame and Series take for pandas API compatibility (#9549) @dantegd
  • Remove sizeof and standardize on memory_usage (#9544) @vyasr
  • Adds cudaProfilerStart/cudaProfilerStop in JNI api (#9543) @abellina
  • Generalize comparison binary operations (#9542) @vyasr
  • Expose APIs to wrap CUDA or RMM allocations with a Java device buffer instance (#9538) @jlowe
  • Add scan sum support for duration types to libcudf (#9536) @davidwendt
  • Force inlining to improve AST performance (#9530) @vyasr
  • Generalize some more indexed frame methods (#9529) @vyasr
  • Add Java bindings for rolling window stddev aggregation (#9527) @razajafri
  • catch rmm::outofmemory exceptions in jni (#9525) @rongou
  • Add an overload of make_empty_column with type_id parameter (#9524) @ttnghia
  • Accelerate conditional inner joins with larger right tables (#9523) @vyasr
  • Initial pass of generalizing decimal support in cudf python layer (#9517) @galipremsagar
  • Cleanup for flattening nested columns (#9509) @rwlee
  • Enable running tests using RMM arena and async memory resources (#9506) @rongou
  • Remove dependency on six. (#9495) @bdice
  • Cleanup some libcudf strings gtests (#9489) @davidwendt
  • Rename strings/arraytests.cu to strings/arraytests.cpp (#9480) @davidwendt
  • Refactor sorting APIs (#9464) @vyasr
  • Implement DataFrame.hashvalues, deprecate DataFrame.hashcolumns. (#9458) @bdice
  • Deprecate Series.hash_encode. (#9457) @bdice
  • Update conda recipes for Enhanced Compatibility effort (#9456) @ajschmidt8
  • Small clean up to simplify column selection code in ORC reader (#9444) @vuule
  • add missing stream to scalar.is_valid() wherever stream is available (#9436) @karthikeyann
  • Adds Deprecation Warnings to one_hot_encoding and Implement get_dummies with Cython API (#9435) @isVoid
  • Update pre-commit hook URLs. (#9433) @bdice
  • Remove pyarrow import in dask_cudf.io.parquet (#9429) @charlesbluca
  • Miscellaneous improvements for UDFs (#9422) @isVoid
  • Use pre-commit for CI (#9412) @vyasr
  • Update to UCX-Py 0.23 (#9407) @pentschev
  • Expose OutOfBoundsPolicy in JNI for Table.gather (#9406) @abellina
  • Improvements to tdigest aggregation code. (#9403) @nvdbaranec
  • Add Java API to deserialize a table to host columns (#9402) @jlowe
  • Frame copy to use class instead of type() (#9397) @madsbk
  • Change all DeprecationWarnings to FutureWarning. (#9392) @bdice
  • Update Java nvcomp JNI bindings to nvcomp 2.x API (#9384) @jbrennan333
  • Add IndexedFrame class and move SingleColumnFrame to a separate module (#9378) @vyasr
  • Support Arrow NativeFile and PythonFile for remote ORC storage (#9377) @rjzamora
  • Use Arrow PythonFile for remote CSV storage (#9376) @rjzamora
  • Add multi-threaded writing to GDS writes (#9372) @devavret
  • Miscellaneous column cleanup (#9370) @vyasr
  • Use single kernel to extract all groups in cudf::strings::extract (#9358) @davidwendt
  • Consolidate binary ops into Frame (#9357) @isVoid
  • Move rank scan implementations from scaninclusive.cu to rankscan.cu (#9351) @davidwendt
  • Remove usage of deprecated thrust::hostspacetag. (#9350) @bdice
  • Use Default Memory Resource for Temporaries in reduction.cpp (#9344) @isVoid
  • Fix Cython compilation warnings. (#9327) @bdice
  • Fix some unused variable warnings in libcudf (#9326) @davidwendt
  • Use optional-iterator for copy-if-else kernel (#9324) @davidwendt
  • Remove Table class (#9315) @vyasr
  • Unpin dask and distributed in CI (#9307) @galipremsagar
  • Add optional-iterator support to indexalator (#9306) @davidwendt
  • Consolidate more methods in Frame (#9305) @vyasr
  • Add Arrow-NativeFile and PythonFile support to readparquet and readcsv in cudf (#9304) @rjzamora
  • Pin mypy in .pre-commit-config.yaml to match conda environment pinning. (#9300) @bdice
  • Use gather.hpp when gather-map exists in device memory (#9299) @davidwendt
  • Fix Automerger for Branch-21.12 from branch-21.10 (#9285) @galipremsagar
  • Refactor cuIO timestamp processing with cuda::std::chrono (#9278) @PointKernel
  • Change strings copyifelse to use optional-iterator instead of pair-iterator (#9266) @davidwendt
  • Update cudf java bindings to 21.12.0-SNAPSHOT (#9248) @pxLi
  • Various internal MultiIndex improvements (#9243) @vyasr
  • Add detail interface for split and slice(table_view), refactors both function with host_span (#9226) @isVoid
  • Refactor MD5 implementation. (#9212) @bdice
  • Update groupby resultcache to allow sharing intermediate results based on columnview instead of requests. (#9195) @karthikeyann
  • Use nvcomp's snappy decompressor in avro reader (#9181) @devavret
  • Add isocalendar API support (#9169) @marlenezw
  • Simplify read_json by removing unnecessary reader/impl classes (#9088) @cwharris
  • Simplify read_csv by removing unnecessary reader/impl classes (#9041) @cwharris
  • Refactor hash join with cuCollections multimap (#8934) @PointKernel

- C++
Published by GPUtester about 4 years ago

https://github.com/rapidsai/cudf - v21.10.01

v21.10.01

- C++
Published by GPUtester over 4 years ago

https://github.com/rapidsai/cudf - v21.10.00

🚨 Breaking Changes

  • Remove Cython APIs for table view generation (#9199) @vyasr
  • Upgrade pandas version in cudf (#9147) @galipremsagar
  • Make AST operators nullable (#9096) @vyasr
  • Remove the option to pass data types as strings to read_csv and read_json (#9079) @vuule
  • Update JNI java CSV APIs to not use deprecated API (#9066) @revans2
  • Support additional format specifiers in from_timestamps (#9047) @davidwendt
  • Expose expression base class publicly and simplify public AST API (#9045) @vyasr
  • Add support for struct type in ORC writer (#9025) @vuule
  • Remove aliases of various api.types APIs from utils.dtypes. (#9011) @vyasr
  • Java bindings for conditional join output sizes (#9002) @jlowe
  • Move compute_column API out of ast namespace (#8957) @vyasr
  • cudf.dtype function (#8949) @shwina
  • Refactor Frame reductions (#8944) @vyasr
  • Add nested column selection to parquet reader (#8933) @devavret
  • JNI Aggregation Type Changes (#8919) @revans2
  • Add groupbyaggregation and groupbyscan_aggregation classes and force their usage. (#8906) @nvdbaranec
  • Expand CSV and JSON reader APIs to accept dtypes as a vector or map of data_type objects (#8856) @vuule
  • Change cudf docs theme to pydata theme (#8746) @galipremsagar
  • Enable compiled binary ops in libcudf, python and java (#8741) @karthikeyann
  • Make groupby transform-like op order match original data order (#8720) @isVoid

πŸ› Bug Fixes

  • fixed_point cudf::groupby for mean aggregation (#9296) @codereport
  • Fix interleave_columns when the input string lists column having empty child column (#9292) @ttnghia
  • Update nvcomp to include fixes for installation of headers (#9276) @devavret
  • Fix Java column leak in testParquetWriteMap (#9271) @jlowe
  • Fix call to thrust::reducebykey in argmin/argmax libcudf groupby (#9263) @davidwendt
  • Fixing empty input to getMapValue crashing (#9262) @hyperbolic2346
  • Fix duplicate names issue in MultiIndex.deserialize (#9258) @galipremsagar
  • Dataframe.sort_index optimizations (#9238) @galipremsagar
  • Temporarily disabling problematic test in parquet writer (#9230) @devavret
  • Explicitly disable groupby on unsupported key types. (#9227) @mythrocks
  • Fix gather for sliced input structs column (#9218) @ttnghia
  • Fix JNI code for left semi and anti joins (#9207) @jlowe
  • Only install thrust when using a non 'system' version (#9206) @robertmaynard
  • Remove zlib from libcudf public CMake dependencies (#9204) @robertmaynard
  • Fix out-of-bounds memory read in orc gpuEncodeOrcColumnData (#9196) @davidwendt
  • Fix gather() for STRUCT inputs with no nulls in members. (#9194) @mythrocks
  • getcucollections properly uses rapidscpm_find (#9189) @robertmaynard
  • rapids-export correctly reference build code block and doc strings (#9186) @robertmaynard
  • Fix logic while parsing the sum statistic for numerical orc columns (#9183) @ayushdg
  • Add handling for nulls in dask_cudf.sorting.quantile_divisions (#9171) @charlesbluca
  • Approximate overflow detection in ORC statistics (#9163) @vuule
  • Use decimal precision metadata when reading from parquet files (#9162) @shwina
  • Fix variable name in Java build script (#9161) @jlowe
  • Import rapids-cmake modules using the correct cmake variable. (#9149) @robertmaynard
  • Fix conditional joins with empty left table (#9146) @vyasr
  • Fix joining on indexes with duplicate level names (#9137) @shwina
  • Fixes missing child column name in dtype while reading ORC file. (#9134) @rgsl888prabhu
  • Apply type metadata after column is slice-copied (#9131) @isVoid
  • Fix a bug: innerjoinsize return zero if build table is empty (#9128) @PointKernel
  • Fix multi hive-partition parquet reading in dask-cudf (#9122) @rjzamora
  • Support null literals in expressions (#9117) @vyasr
  • Fix cudf::hash_join output size for struct joins (#9107) @jlowe
  • Import fix (#9104) @shwina
  • Fix cudf::strings::isfixedpoint checking of overflow for decimal32 (#9093) @davidwendt
  • Fix branchstack calculation in `rowbit_count()` (#9076) @mythrocks
  • Fetch rapids-cmake to work around cuCollection cmake issue (#9075) @jlowe
  • Fix compilation errors in groupby benchmarks. (#9072) @nvdbaranec
  • Preserve float16 upscaling (#9069) @galipremsagar
  • Fix memcheck read error in libcudf contiguous_split (#9067) @davidwendt
  • Add support for reading ORC file with no row group index (#9060) @rgsl888prabhu
  • Various multiindex related fixes (#9036) @shwina
  • Avoid rebuilding cython in build.sh (#9034) @brandon-b-miller
  • Add support for percentile dispatch in dask_cudf (#9031) @galipremsagar
  • cudf resolve nvcc 11.0 compiler crashes during codegen (#9028) @robertmaynard
  • Fetch correct grouping keys agg of dask groupby (#9022) @galipremsagar
  • Allow where() to work with a Series and other=cudf.NA (#9019) @sarahyurick
  • Use correct index when returning Series from GroupBy.apply() (#9016) @charlesbluca
  • Fix Dataframe indexer setitem when array is passed (#9006) @galipremsagar
  • Fix ORC reading of files with struct columns that have null values (#9005) @vuule
  • Ensure JNI native libraries load when CompiledExpression loads (#8997) @jlowe
  • Fix memory read error in getdremeldata in page_enc.cu (#8995) @davidwendt
  • Fix memory write error in getlistchildtolistrowmapping utility (#8994) @davidwendt
  • Fix debug compile error for csv_test.cpp (#8981) @davidwendt
  • Fix memory read/write error in concatenatelistsignore_null (#8978) @davidwendt
  • Fix concatenation of cudf.RangeIndex (#8970) @galipremsagar
  • Java conditional joins should not require matching column counts (#8955) @jlowe
  • Fix concatenate empty structs (#8947) @sperlingxx
  • Fix cuda-memcheck errors for some libcudf functions (#8941) @davidwendt
  • Apply series name to result of SeriesGroupby.apply() (#8939) @charlesbluca
  • cdef packed_columns as cppclass instead of struct (#8936) @charlesbluca
  • Inserting a cudf.NA into a DataFrame (#8923) @sarahyurick
  • Support casting with Pandas dtype aliases (#8920) @sarahyurick
  • Allow sort_values to accept same kind values as Pandas (#8912) @sarahyurick
  • Enable casting to pandas nullable dtypes (#8889) @brandon-b-miller
  • Fix libcudf memory errors (#8884) @karthikeyann
  • Throw KeyError when accessing field from struct with nonexistent key (#8880) @NV-jpt
  • replace auto with auto& ref for cast<&> (#8866) @karthikeyann
  • Add missing include<optional> in binops (#8864) @karthikeyann
  • Fix select_dtypes to work when non-class dtypes present in dataframe (#8849) @sarahyurick
  • Re-enable JSON tests (#8843) @vuule
  • Support header with embedded delimiter in csv writer (#8798) @davidwendt

πŸ“– Documentation

  • Add IO docs page in cudf documentation (#9145) @galipremsagar
  • use correct namespace in cuio code examples (#9037) @cwharris
  • Restructuring Contributing doc (#9026) @iskode
  • Update stable version in readme (#9008) @galipremsagar
  • Add spans and more include guidelines to libcudf developer guide (#8931) @harrism
  • Update Java build instructions to mention Arrow S3 and Docker (#8867) @jlowe
  • List GDS-enabled formats in the docs (#8805) @vuule
  • Change cudf docs theme to pydata theme (#8746) @galipremsagar

πŸš€ New Features

  • Revert "Add shallow hash function and shallow equality comparison for column_view (#9185)" (#9283) @karthikeyann
  • Align DataFrame.apply signature with pandas (#9275) @brandon-b-miller
  • Add struct type support for drop_list_duplicates (#9202) @ttnghia
  • support CUDA async memory resource in JNI (#9201) @rongou
  • Add shallow hash function and shallow equality comparison for column_view (#9185) @karthikeyann
  • Superimpose null masks for STRUCT columns. (#9144) @mythrocks
  • Implemented bindings for ceil timestamp operation (#9141) @shaneding
  • Adding MAP type support for ORC Reader (#9132) @rgsl888prabhu
  • Implement interleave_columns for lists with arbitrary nested type (#9130) @ttnghia
  • Add python bindings to fixed-size window and groupby rolling.var, rolling.std (#9097) @isVoid
  • Make AST operators nullable (#9096) @vyasr
  • Java bindings for approx_percentile (#9094) @andygrove
  • Add dseries.struct.explode (#9086) @isVoid
  • Add support for BaseIndexer in Rolling APIs (#9085) @galipremsagar
  • Remove the option to pass data types as strings to read_csv and read_json (#9079) @vuule
  • Add handling for nested dicts in dask-cudf groupby (#9054) @charlesbluca
  • Added Series.dt.isquarterstart and Series.dt.isquarterend (#9046) @TravisHester
  • Support nested types for nth_element reduction (#9043) @sperlingxx
  • Update sort groupby to use non-atomic operation (#9035) @karthikeyann
  • Add support for struct type in ORC writer (#9025) @vuule
  • Implement interleave_columns for structs columns (#9012) @ttnghia
  • Add groupby first and last aggregations (#9004) @shwina
  • Add DecimalBaseColumn and move as_decimal_column (#9001) @isVoid
  • Python/Cython bindings for multibyte_split (#8998) @jdye64
  • Support scalar months in add_calendrical_months, extends API to INT32 support (#8991) @isVoid
  • Added Series.dt.ismonthend (#8989) @TravisHester
  • Support for using tdigests to compute approximate percentiles. (#8983) @nvdbaranec
  • Support "unflatten" of columns flattened via flatten_nested_columns(): (#8956) @mythrocks
  • Implement timestamp ceil (#8942) @shaneding
  • Add nested column selection to parquet reader (#8933) @devavret
  • Expose conditional join size calculation (#8928) @vyasr
  • Support Nulls in Timeseries Generator (#8925) @isVoid
  • Avoid index equality check in _CPackedColumns.from_py_table() (#8917) @charlesbluca
  • Add dot product binary op (#8909) @charlesbluca
  • Expose days_in_month function in libcudf and add python bindings (#8892) @isVoid
  • Series string repeat (#8882) @sarahyurick
  • Python binding for quarters (#8862) @shaneding
  • Expand CSV and JSON reader APIs to accept dtypes as a vector or map of data_type objects (#8856) @vuule
  • Add Java bindings for AST transform (#8846) @jlowe
  • Series datetime ismonthstart (#8844) @sarahyurick
  • Support bracket syntax for cudf::strings::replacewithbackrefs group index values (#8841) @davidwendt
  • Support VARIANCE and STD aggregation in rolling op (#8809) @isVoid
  • Add quarters to libcudf datetime (#8779) @shaneding
  • Linear Interpolation of nans via cupy (#8767) @brandon-b-miller
  • Enable compiled binary ops in libcudf, python and java (#8741) @karthikeyann
  • Make groupby transform-like op order match original data order (#8720) @isVoid
  • multibyte_split (#8702) @cwharris
  • Implement JNI for strings:repeat_strings that repeats each string separately by different numbers of times (#8572) @ttnghia

πŸ› οΈ Improvements

  • Pin max dask and distributed versions to 2021.09.1 (#9286) @galipremsagar
  • Optimized fsspec data transfer for remote file-systems (#9265) @rjzamora
  • Skip dask-cudf tests on arm64 (#9252) @Ethyling
  • Use nvcomp's snappy compressor in ORC writer (#9242) @devavret
  • Only run imports tests on x86_64 (#9241) @Ethyling
  • Remove unnecessary call to device_uvector::release() (#9237) @harrism
  • Use nvcomp's snappy decompression in ORC reader (#9235) @devavret
  • Add grouped_rolling test with STRUCT groupby keys. (#9228) @mythrocks
  • Optimize cudf.concat for axis=0 (#9222) @galipremsagar
  • Fix some libcudf calls not passing the stream parameter (#9220) @davidwendt
  • Add min and max bounds for random dataframe generator numeric types (#9211) @galipremsagar
  • Improve performance of expression evaluation (#9210) @vyasr
  • Misc optimizations in cudf (#9203) @galipremsagar
  • Remove Cython APIs for table view generation (#9199) @vyasr
  • Add JNI support for droplistduplicates (#9198) @revans2
  • Update pandas versions in conda recipes and requirements.txt files (#9197) @galipremsagar
  • Minor C++17 cleanup of groupby.cu: structured bindings, more concise lambda, etc (#9193) @codereport
  • Explicit about bitwidth difference between cudf boolean and arrow boolean (#9192) @isVoid
  • Remove sourceindex from MultiIndex (#9191) @vyasr
  • Fix typo in the name of cudf-testing-targets.cmake (#9190) @trxcllnt
  • Add support for single-digits in cudf::to_timestamps (#9173) @davidwendt
  • Fix cufilejni build include path (#9168) @pxLi
  • dask_cudf dispatch registering cleanup (#9160) @galipremsagar
  • Remove unneeded stream/mr from a cudf::makestringscolumn (#9148) @davidwendt
  • Upgrade pandas version in cudf (#9147) @galipremsagar
  • make data chunk reader return unique_ptr (#9129) @cwharris
  • Add backend for percentile_lookup dispatch (#9118) @galipremsagar
  • Refactor implementation of column setitem (#9110) @vyasr
  • Fix compile warnings found using nvcc 11.4 (#9101) @davidwendt
  • Update to UCX-Py 0.22 (#9099) @pentschev
  • Simplify read_avro by removing unnecessary writer/impl classes (#9090) @cwharris
  • Allowing %f in format to return nanoseconds (#9081) @marlenezw
  • Java bindings for cudf::hash_join (#9080) @jlowe
  • Remove stale code in ColumnBase._fill (#9078) @isVoid
  • Add support for get_group in GroupBy (#9070) @galipremsagar
  • Remove remaining "support" methods from DataFrame (#9068) @vyasr
  • Update JNI java CSV APIs to not use deprecated API (#9066) @revans2
  • Added method to remove null_masks if the column has no nulls (#9061) @razajafri
  • Consolidate Several Series and Dataframe Methods (#9059) @isVoid
  • Remove usage of string based set_dtypes for csv & json readers (#9049) @galipremsagar
  • Remove some debug print statements from gtests (#9048) @davidwendt
  • Support additional format specifiers in from_timestamps (#9047) @davidwendt
  • Expose expression base class publicly and simplify public AST API (#9045) @vyasr
  • move filepath and mmap logic out of json/csv up to functions.cpp (#9040) @cwharris
  • Refactor Index hierarchy (#9039) @vyasr
  • cudf now leverages rapids-cmake to reduce CMake boilerplate (#9030) @robertmaynard
  • Add support for STRUCT input to groupby (#9024) @mythrocks
  • Refactor Frame scans (#9021) @vyasr
  • Remove duplicate set_categories code (#9018) @isVoid
  • Map support for ParquetWriter (#9013) @razajafri
  • Remove aliases of various api.types APIs from utils.dtypes. (#9011) @vyasr
  • Java bindings for conditional join output sizes (#9002) @jlowe
  • Remove copyconstruct factory (#8999) @vyasr
  • ENH Allow arbitrary CMake config options in build.sh (#8996) @dillon-cullinan
  • A small optimization for JNI copy column view to column vector (#8985) @revans2
  • Fix nvcc warnings in ORC writer (#8975) @devavret
  • Support nested structs in rank and dense rank (#8962) @rwlee
  • Move compute_column API out of ast namespace (#8957) @vyasr
  • Series datetime isyearend and isyearstart (#8954) @marlenezw
  • Make Java AstNode public (#8953) @jlowe
  • Replace allocate with deviceuvector for subwordtokenize internal tables (#8952) @davidwendt
  • cudf.dtype function (#8949) @shwina
  • Refactor Frame reductions (#8944) @vyasr
  • Add deprecation warning for Series.set_mask API (#8943) @galipremsagar
  • Move AST evaluator into a separate header (#8930) @vyasr
  • JNI Aggregation Type Changes (#8919) @revans2
  • Move template parameter to function parameter in cudf::detail::leftsemianti_join (#8914) @davidwendt
  • Upgrade arrow & pyarrow to 5.0.0 (#8908) @galipremsagar
  • Add groupbyaggregation and groupbyscan_aggregation classes and force their usage. (#8906) @nvdbaranec
  • Move structs_column_tests.cu to .cpp. (#8902) @mythrocks
  • Add stream and memory-resource parameters to struct-scalar copy ctor (#8901) @davidwendt
  • Combine linearizer and ast_plan (#8900) @vyasr
  • Add Java bindings for conditional join gather maps (#8888) @jlowe
  • Remove max version pin for dask & distributed on development branch (#8881) @galipremsagar
  • fix cufilejni build w/ c++17 (#8877) @pxLi
  • Add struct accessor to dask-cudf (#8874) @NV-jpt
  • Migrate dask-cudf CudfEngine to leverage ArrowDatasetEngine (#8871) @rjzamora
  • Add JNI for extractquarter, addcalendricalmonths, and isleap_year (#8863) @revans2
  • Change cudf::scalar copy and move constructors to protected (#8857) @davidwendt
  • Replace is_same&lt;&gt;::value with is_same_v&lt;&gt; (#8852) @codereport
  • Add min pytorch version to importorskip in pytest (#8851) @galipremsagar
  • Java bindings for regex replace (#8847) @jlowe
  • Remove make strings children with null mask (#8830) @davidwendt
  • Refactor conditional joins (#8815) @vyasr
  • Small cleanup (unused headers / commented code removals) (#8799) @codereport
  • ENH Replace gpucicondaretry with gpucimambaretry (#8770) @dillon-cullinan
  • Update cudf java bindings to 21.10.0-SNAPSHOT (#8765) @pxLi
  • Refactor and improve join benchmarks with nvbench (#8734) @PointKernel
  • Refactor Python factories and remove usage of Table for libcudf output handling (#8687) @vyasr
  • Optimize URL Decoding (#8622) @gaohao95
  • Parquet writer dictionary encoding refactor (#8476) @devavret
  • Use nvcomp's snappy decompression in parquet reader (#8252) @devavret
  • Use nvcomp's snappy compressor in parquet writer (#8229) @devavret

- C++
Published by GPUtester over 4 years ago

https://github.com/rapidsai/cudf - v21.08.03

v21.08.03

- C++
Published by GPUtester over 4 years ago

https://github.com/rapidsai/cudf - v21.08.02

v21.08.02

- C++
Published by GPUtester over 4 years ago

https://github.com/rapidsai/cudf - v21.08.01

v21.08.01

- C++
Published by GPUtester over 4 years ago

https://github.com/rapidsai/cudf - v21.08.00

🚨 Breaking Changes

  • Fix a crash in pack() when being handed tables with no columns. (#8697) @nvdbaranec
  • Remove unused cudf::strings::create_offsets (#8663) @davidwendt
  • Add delimiter parameter to cudf::strings::capitalize() (#8620) @davidwendt
  • Change default datetime index resolution to ns to match pandas (#8611) @vyasr
  • Add sequence_type parameter to cudf::strings::title function (#8602) @davidwendt
  • Add strings::repeat_strings API that can repeat each string a different number of times (#8561) @ttnghia
  • String-to-boolean conversion is different from Pandas (#8549) @skirui-source
  • Add accurate hash join size functions (#8453) @PointKernel
  • Expose a Decimal32Dtype in cuDF Python (#8438) @skirui-source
  • Update dask make_meta changes to be compatible with dask upstream (#8426) @galipremsagar
  • Adapt cudf::scalar classes to changes in rmm::device_scalar (#8411) @harrism
  • Remove special Index class from the general index class hierarchy (#8309) @vyasr
  • Add first-class dtype utilities (#8308) @vyasr
  • ORC - Support reading multiple orc files/buffers in a single operation (#8142) @jdye64
  • Upgrade arrow to 4.0.1 (#7495) @galipremsagar

πŸ› Bug Fixes

  • Fix contains check in string column (#8834) @galipremsagar
  • Remove unused variable from row_bit_count_test. (#8829) @mythrocks
  • Fixes issue with null struct columns in ORC reader (#8819) @rgsl888prabhu
  • Set CMake vars for python/parquet support in libarrow builds (#8808) @vyasr
  • Handle empty child columns in rowbitcount() (#8791) @mythrocks
  • Revert "Remove cudf unneeded build time requirement of the cuda driver" (#8784) @robertmaynard
  • Fix isort error in utils.pyx (#8771) @charlesbluca
  • Handle sliced struct/list columns properly in concatenate() bounds checking. (#8760) @nvdbaranec
  • Fix issues with _CPackedColumns.serialize() handling of host and device data (#8759) @charlesbluca
  • Fix issues with MultiIndex in dropna, stack & reset_index (#8753) @galipremsagar
  • Write pandas extension types to parquet file metadata (#8749) @devavret
  • Fix where to handle DataFrame & Series input combination (#8747) @galipremsagar
  • Fix replace to handle null values correctly (#8744) @galipremsagar
  • Handle sliced structs properly in pack/contiguous_split. (#8739) @nvdbaranec
  • Fix issue in slice() where columns with a positive offset were computing null counts incorrectly. (#8738) @nvdbaranec
  • Fix cudf.Series constructor to handle list of sequences (#8735) @galipremsagar
  • Fix min/max sorted groupby aggregation on string column with nulls (argmin, argmax sentinel value missing on nulls) (#8731) @karthikeyann
  • Fix orc reader assert on create data_type in debug (#8706) @davidwendt
  • Fix min/max inclusive cudf::scan for strings column (#8705) @davidwendt
  • JNI: Fix driver version assertion logic in testGetCudaRuntimeInfo (#8701) @sperlingxx
  • Adding fix for skip_rows and crash in orc reader (#8700) @rgsl888prabhu
  • Bug fix: replace_nulls_policy functor not returning correct indices for gathermap (#8699) @isVoid
  • Fix a crash in pack() when being handed tables with no columns. (#8697) @nvdbaranec
  • Add post-processing steps to dask_cudf.groupby.CudfSeriesGroupby.aggregate (#8694) @charlesbluca
  • JNI build no longer looks for Arrow in conda environment (#8686) @jlowe
  • Handle arbitrarily different data in null list column rows when checking for equivalency. (#8666) @nvdbaranec
  • Add ConfigureNVBench to avoid concurrent main() entry points (#8662) @PointKernel
  • Pin *arrow to use *cuda in run (#8651) @jakirkham
  • Add proper support for tolerances in testing methods. (#8649) @vyasr
  • Support multi-char case conversion in capitalize function (#8647) @davidwendt
  • Fix repeated mangled names in read_csv with duplicate column names (#8645) @karthikeyann
  • Temporarily disable libcudf example build tests (#8642) @isVoid
  • Use conda-sourced cudf artifacts for libcudf example in CI (#8638) @isVoid
  • Ensure dev environment uses Arrow GPU packages (#8637) @charlesbluca
  • Fix bug that columns only initialized once when specified columns and index in dataframe ctor (#8628) @isVoid
  • Propagate *kwargs through to as__column methods (#8618) @shwina
  • Fix orcreaderbenchmark.cpp compile error (#8609) @davidwendt
  • Fix missed renumbering of Aggregation values (#8600) @revans2
  • Update cmake to 3.20.5 in the Java Docker image (#8593) @NvTimLiu
  • Fix bug in replacewithbackrefs when group has greedy quantifier (#8575) @davidwendt
  • Apply metadata to keys before returning in Frame._encode (#8560) @charlesbluca
  • Fix for strings containing special JSON characters in getjsonobject(). (#8556) @nvdbaranec
  • Fix debug compile error in gatherstructtests.cpp (#8554) @davidwendt
  • String-to-boolean conversion is different from Pandas (#8549) @skirui-source
  • Fix __repr__ output with display.max_rows is None (#8547) @galipremsagar
  • Fix size passed to column constructors in withtype_metadata (#8539) @shwina
  • Properly retrieve last column when -1 is specified for column index (#8529) @isVoid
  • Fix importing apply from dask (#8517) @galipremsagar
  • Fix offset of the string dictionary length stream (#8515) @vuule
  • Fix double counting of selected columns in CSV reader (#8508) @ochan1
  • Incorrect map size in scattertogather corrupts struct columns (#8507) @gerashegalov
  • replace_nulls properly propagates memory resource to gather calls (#8500) @robertmaynard
  • Disallow groupby aggs for StructColumns (#8499) @charlesbluca
  • Fixes out-of-bounds access for small files in unzip (#8498) @elstehle
  • Adding support for writing empty dataframe (#8490) @shaneding
  • Fix exclusive scan when including nulls and improve testing (#8478) @harrism
  • Add workaround for crash in libcudf debug build using outputindexalator in thrust::lowerbound (#8432) @davidwendt
  • Install only the same Thrust files that Thrust itself installs (#8420) @robertmaynard
  • Add nightly version for ucx-py in ci script (#8419) @galipremsagar
  • Fix nullequality config of rollingcollect_set (#8415) @sperlingxx
  • CollectSetAggregation: implement RollingAggregation interface (#8406) @sperlingxx
  • Handle pre-sliced nested columns in contiguous_split. (#8391) @nvdbaranec
  • Fix bitmask_tests.cpp host accessing device memory (#8370) @davidwendt
  • Fix concurrentunorderedmap to prevent accessing padding bits in pair_type (#8348) @davidwendt
  • BUG FIX: Raise appropriate strings error when concatenating strings column (#8290) @skirui-source
  • Make gpuCI and pre-commit style configurations consistent (#8215) @charlesbluca
  • Add collect list to dask-cudf groupby aggregations (#8045) @charlesbluca

πŸ“– Documentation

  • Update Python UDFs notebook (#8810) @brandon-b-miller
  • Fix dask.dataframe API docs links after reorg (#8772) @jsignell
  • Fix instructions for running cuDF/dask-cuDF tests in CONTRIBUTING.md (#8724) @shwina
  • Translate Markdown documentation to rST and remove recommonmark (#8698) @vyasr
  • Fixed spelling mistakes in libcudf documentation (#8664) @karthikeyann
  • Custom Sphinx Extension: PandasCompat (#8643) @isVoid
  • Fix README.md (#8535) @ajschmidt8
  • Change namespace contains_nulls to struct (#8523) @davidwendt
  • Add info about NVTX ranges to dev guide (#8461) @jrhemstad
  • Fixed documentation bug in groupby agg method (#8325) @ahmet-uyar

πŸš€ New Features

  • Fix concatenating structs (#8811) @shaneding
  • Implement JNI for groupby aggregations M2 and MERGE_M2 (#8763) @ttnghia
  • Bump isort to 5.6.4 and remove isort overrides made for 5.0.7 (#8755) @charlesbluca
  • Implement __setitem__ for StructColumn (#8737) @shaneding
  • Add is_leap_year to DateTimeProperties and DatetimeIndex (#8736) @isVoid
  • Add struct.explode() method (#8729) @shwina
  • Add DataFrame.to_struct() method to convert a DataFrame to a struct Series (#8728) @shwina
  • Add support for list type in ORC writer (#8723) @vuule
  • Fix slicing from struct columns and accessing struct columns (#8719) @shaneding
  • Add datetime::is_leap_year (#8711) @isVoid
  • Accessing struct columns from dask_cudf (#8675) @shaneding
  • Added pct_change to Series (#8650) @TravisHester
  • Add strings support to cudf::shift function (#8648) @davidwendt
  • Support Scatter struct_scalar (#8630) @isVoid
  • Struct scalar from host dictionary (#8629) @shaneding
  • Add dayofyear and dayofyear to Series, DatetimeColumn, and DatetimeIndex (#8626) @beckernick
  • JNI support for capitalize (#8624) @firestarman
  • Add delimiter parameter to cudf::strings::capitalize() (#8620) @davidwendt
  • Add NVBench in CMake (#8619) @PointKernel
  • Change default datetime index resolution to ns to match pandas (#8611) @vyasr
  • ListColumn __setitem__ (#8606) @brandon-b-miller
  • Implement groupby aggregations M2 and MERGE_M2 (#8605) @ttnghia
  • Add sequence_type parameter to cudf::strings::title function (#8602) @davidwendt
  • Adding support for list and struct type in ORC Reader (#8599) @rgsl888prabhu
  • Benchmark for strings::repeat_strings APIs (#8589) @ttnghia
  • Nested scalar support for copy if else (#8588) @gerashegalov
  • User specified decimal columns to float64 (#8587) @jdye64
  • Add get_element for struct column (#8578) @isVoid
  • Python changes for adding __getitem__ for struct (#8577) @shaneding
  • Add strings::repeat_strings API that can repeat each string a different number of times (#8561) @ttnghia
  • Refactor tests/iterator_utilities.hpp functions (#8540) @ttnghia
  • Support MERGELISTS and MERGESETS in Java package (#8516) @sperlingxx
  • Decimal support csv reader (#8511) @elstehle
  • Add column type tests (#8505) @isVoid
  • Warn when downscaling decimal columns (#8492) @ChrisJar
  • Add JNI for strings::repeat_strings (#8491) @ttnghia
  • Add Index.get_loc for Numerical, String Index support (#8489) @isVoid
  • Expose half_up rounding in cuDF (#8477) @shwina
  • Java APIs to fetch CUDA runtime info (#8465) @sperlingxx
  • Add str.edit_distance_matrix (#8463) @isVoid
  • Support constructing cudf.Scalar objects from host side lists (#8459) @brandon-b-miller
  • Add accurate hash join size functions (#8453) @PointKernel
  • Add cudf::strings::integertohex convert API (#8450) @davidwendt
  • Create objects from iterables that contain cudf.NA (#8442) @brandon-b-miller
  • JNI bindings for sort_lists (#8439) @sperlingxx
  • Expose a Decimal32Dtype in cuDF Python (#8438) @skirui-source
  • Replace all_null() and all_valid() by iterator_all_nulls() and iterator_no_null() in tests (#8437) @ttnghia
  • Implement groupby MERGE_LISTS and MERGE_SETS aggregates (#8436) @ttnghia
  • Add public libcudf match_dictionaries API (#8429) @davidwendt
  • Add move constructors for string_scalar and struct_scalar (#8428) @ttnghia
  • Implement strings::repeat_strings (#8423) @ttnghia
  • STRUCT column support for cudf::merge. (#8422) @nvdbaranec
  • Implement reverse in libcudf (#8410) @shaneding
  • Support multiple input files/buffers for read_json (#8403) @jdye64
  • Improve test coverage for struct search (#8396) @ttnghia
  • Add groupby.fillna (#8362) @isVoid
  • Enable AST-based joining (#8214) @vyasr
  • Generalized null support in user defined functions (#8213) @brandon-b-miller
  • Add compiled binary operation (#8192) @karthikeyann
  • Implement .describe() for DataFrameGroupBy (#8179) @skirui-source
  • ORC - Support reading multiple orc files/buffers in a single operation (#8142) @jdye64
  • Add Python bindings for lists::concatenate_list_elements and expose them as .list.concat() (#8006) @shwina
  • Use Arrow URI FileSystem backed instance to retrieve remote files (#7709) @jdye64
  • Example to build custom application and link to libcudf (#7671) @isVoid
  • Upgrade arrow to 4.0.1 (#7495) @galipremsagar

πŸ› οΈ Improvements

  • Provide a better error message when CUDA::cuda_driver not found (#8794) @robertmaynard
  • Remove anonymous namespace from null_mask.cuh (#8786) @nvdbaranec
  • Allow cudf to be built without libcuda.so existing (#8751) @robertmaynard
  • Pin mimesis to &lt;4.1 (#8745) @galipremsagar
  • Update conda environment name for CI (#8692) @ajschmidt8
  • Remove flatbuffers dependency (#8671) @Ethyling
  • Add options to build Arrow with Python and Parquet support (#8670) @trxcllnt
  • Remove unused cudf::strings::create_offsets (#8663) @davidwendt
  • Update GDS lib version to 1.0.0 (#8654) @pxLi
  • Support for groupby/scan rank and dense_rank aggregations (#8652) @rwlee
  • Fix usage of deprecated arrow ipc API (#8632) @revans2
  • Use absolute imports in cudf (#8631) @galipremsagar
  • ENH Add Java CI build script (#8627) @dillon-cullinan
  • Add DeprecationWarning to ser.str.subword_tokenize (#8603) @VibhuJawa
  • Rewrite binary operations for improved performance and additional type support (#8598) @vyasr
  • Fix mypy errors surfacing because of numpy-1.21.0 (#8595) @galipremsagar
  • Remove unneeded includes from cudf::string_view headers (#8594) @davidwendt
  • Use cmake 3.20.1 as it is now required by rmm (#8586) @robertmaynard
  • Remove device debug symbols from cmake CUDFCUDAFLAGS (#8584) @davidwendt
  • Dask-CuDF: use default Dask Dataframe optimizer (#8581) @madsbk
  • Remove checking if an unsigned value is less than zero (#8579) @robertmaynard
  • Remove stringscount parameter from cudf::strings::detail::createcharschildcolumn (#8576) @davidwendt
  • Make cudf.api.types imports consistent (#8571) @galipremsagar
  • Modernize libcudf basic example CMakeFile; updates CI build tests (#8568) @isVoid
  • Rename concatenate_tests.cu to .cpp (#8555) @davidwendt
  • enable window lead/lag test on struct (#8548) @wbo4958
  • Add Java methods to split and write column views (#8546) @razajafri
  • Small cleanup (#8534) @codereport
  • Unpin dask version in CI (#8533) @galipremsagar
  • Added optional flag for building Arrow with S3 filesystem support (#8531) @jdye64
  • Minor clean up of various internal column and frame utilities (#8528) @vyasr
  • Rename some copying_test source files .cu to .cpp (#8527) @davidwendt
  • Correct the last warnings and issues when using newer cuda versions (#8525) @robertmaynard
  • Correct unused parameter warnings in transform and unary ops (#8521) @robertmaynard
  • Correct unused parameter warnings in string algorithms (#8509) @robertmaynard
  • Add in JNI APIs for scan, replacenulls, groupby.scan, and groupby.replacenulls (#8503) @revans2
  • Fix 21.08 forward-merge conflicts (#8502) @ajschmidt8
  • Fix Cython formatting command in Contributing.md. (#8496) @marlenezw
  • Bug/correct unused parameters in reshape and text (#8495) @robertmaynard
  • Correct unused parameter warnings in partitioning and stream compact (#8494) @robertmaynard
  • Correct unused parameter warnings in labelling and list algorithms (#8493) @robertmaynard
  • Refactor index construction (#8485) @vyasr
  • Correct unused parameter warnings in replace algorithms (#8483) @robertmaynard
  • Correct unused parameter warnings in reduction algorithms (#8481) @robertmaynard
  • Correct unused parameter warnings in io algorithms (#8480) @robertmaynard
  • Correct unused parameter warnings in interop algorithms (#8479) @robertmaynard
  • Correct unused parameter warnings in filling algorithms (#8468) @robertmaynard
  • Correct unused parameter warnings in groupby (#8467) @robertmaynard
  • use libcu++ time_point as timestamp (#8466) @karthikeyann
  • Modify reprog_device::extract to return groups in a single pass (#8460) @davidwendt
  • Update minimum Dask requirement to 2021.6.0 (#8458) @pentschev
  • Fix failures when performing binary operations on DataFrames with empty columns (#8452) @ChrisJar
  • Fix conflicts in 8447 (#8448) @ajschmidt8
  • Add serialization methods for List and StructDtype (#8441) @charlesbluca
  • Replace makeemptystringscolumn with makeempty_column (#8435) @davidwendt
  • JNI bindings for get_element (#8433) @revans2
  • Update dask make_meta changes to be compatible with dask upstream (#8426) @galipremsagar
  • Unpin dask version on CI (#8425) @galipremsagar
  • Add benchmark for strings/fixed_point convert APIs (#8417) @davidwendt
  • Adapt cudf::scalar classes to changes in rmm::device_scalar (#8411) @harrism
  • Add benchmark for strings/integers convert APIs (#8402) @davidwendt
  • Enable multi-file partitioning in daskcudf.readparquet (#8393) @rjzamora
  • Correct unused parameter warnings in rolling algorithms (#8390) @robertmaynard
  • Correct unused parameters in column round and search (#8389) @robertmaynard
  • Add functionality to apply Dtype metadata to ColumnBase (#8373) @charlesbluca
  • Refactor setting stack size in regex code (#8358) @davidwendt
  • Update Java bindings to 21.08-SNAPSHOT (#8344) @pxLi
  • Replace remaining uses of device_vector (#8343) @harrism
  • Statically link libnvcomp into libcudfjni (#8334) @jlowe
  • Resolve auto merge conflicts for Branch 21.08 from branch 21.06 (#8329) @galipremsagar
  • Minor code refactor for sorted_order (#8326) @wbo4958
  • Remove special Index class from the general index class hierarchy (#8309) @vyasr
  • Add first-class dtype utilities (#8308) @vyasr
  • Add option to link Java bindings with Arrow dynamically (#8307) @jlowe
  • Refactor ColumnMethods and its subclasses to remove column argument and require parent argument (#8306) @shwina
  • Refactor scatter for list columns (#8255) @isVoid
  • Expose pack/unpack API to Python (#8153) @charlesbluca
  • Adding cudf.cut method (#8002) @marlenezw
  • Optimize string gather performance for large strings (#7980) @gaohao95
  • Add peak memory usage tracking to cuIO benchmarks (#7770) @devavret
  • Updating Clang Version to 11.0.0 (#6695) @codereport

- C++
Published by GPUtester over 4 years ago

https://github.com/rapidsai/cudf - v21.06.01

- C++
Published by GPUtester over 4 years ago

https://github.com/rapidsai/cudf - v21.06.00

🚨 Breaking Changes

  • Add support for make_meta_obj dispatch in dask-cudf (#8342) @galipremsagar
  • Add separator-on-null parameter to strings concatenate APIs (#8282) @davidwendt
  • Introduce a common parent class for NumericalColumn and DecimalColumn (#8278) @vyasr
  • Update ORC statistics API to use C++17 standard library (#8241) @vuule
  • Preserve column hierarchy when getting NULL row from LIST column (#8206) @isVoid
  • Groupby.shift c++ API refactor and python binding (#8131) @isVoid

πŸ› Bug Fixes

  • Fix struct flattening to add a validity column only when the input column has null element (#8374) @ttnghia
  • Compilation fix: Remove redefinition for std::is_same_v() (#8369) @mythrocks
  • Add backward compatibility for dask-cudf to work with other versions of dask (#8368) @galipremsagar
  • Handle empty results with nested types in copyifelse (#8359) @nvdbaranec
  • Handle nested column types properly for empty parquet files. (#8350) @nvdbaranec
  • Raise error when unsupported arguments are passed to dask_cudf.DataFrame.sort_values (#8349) @galipremsagar
  • Raise NotImplementedError for axis=1 in rank (#8347) @galipremsagar
  • Add support for make_meta_obj dispatch in dask-cudf (#8342) @galipremsagar
  • Update Java string concatenate test for single column (#8330) @tgravescs
  • Use empty_like in scatter (#8314) @revans2
  • Fix concatenatelistsignorenull on rows of allnulls (#8312) @sperlingxx
  • Add separator-on-null parameter to strings concatenate APIs (#8282) @davidwendt
  • COLLECT_LIST support returning empty output columns. (#8279) @mythrocks
  • Update io util to convert path like object to string (#8275) @ayushdg
  • Fix result column types for empty inputs to rolling window (#8274) @mythrocks
  • Actually test equality in assertgroupbyresults_equal (#8272) @shwina
  • CMake always explicitly specify a source files extension (#8270) @robertmaynard
  • Fix struct binary search and struct flattening (#8268) @ttnghia
  • Revert "patch thrust to fix intmax num elements limitation in scanbykey" (#8263) @cwharris
  • upgrade dlpack to 0.5 (#8262) @cwharris
  • Fixes CSV-reader type inference for thousands separator and decimal point (#8261) @elstehle
  • Fix incorrect assertion in Java concat (#8258) @sperlingxx
  • Copy nested types upon construction (#8244) @isVoid
  • Preserve column hierarchy when getting NULL row from LIST column (#8206) @isVoid
  • Clip decimal binary op precision at max precision (#8194) @ChrisJar

πŸ“– Documentation

  • Add docstring for dask_cudf.read_csv (#8355) @galipremsagar
  • Fix cudf release version in readme (#8331) @galipremsagar
  • Fix structs column description in dev docs (#8318) @isVoid
  • Update readme with correct CUDA versions (#8315) @raydouglass
  • Add description of the cuIO GDS integration (#8293) @vuule
  • Remove unused parameter from copy_partition kernel documentation (#8283) @robertmaynard

πŸš€ New Features

  • Add support merging b/w categorical data (#8332) @galipremsagar
  • Java: Support struct scalar (#8327) @sperlingxx
  • added ishomogeneous property (#8299) @shaneding
  • Added decimal writing for CSV writer (#8296) @kaatish
  • Java: Support creating a scalar from utf8 string (#8294) @firestarman
  • Add Java API for Concatenate strings with separator (#8289) @tgravescs
  • strings::join_list_elements options for empty list inputs (#8285) @ttnghia
  • Return python lists for getitem calls to list type series (#8265) @brandon-b-miller
  • add unit tests for lead/lag on list for row window (#8259) @wbo4958
  • Create a String column from UTF8 String byte arrays (#8257) @firestarman
  • Support scattering list_scalar (#8256) @isVoid
  • Implement lists::concatenate_list_elements (#8231) @ttnghia
  • Support for struct scalars. (#8220) @nvdbaranec
  • Add support for decimal types in ORC writer (#8198) @vuule
  • Support create lists column from a list_scalar (#8185) @isVoid
  • Groupby.shift c++ API refactor and python binding (#8131) @isVoid
  • Add groupby::replace_nulls(replace_policy) api (#7118) @isVoid

πŸ› οΈ Improvements

  • Support Dask + Distributed 2021.05.1 (#8392) @jakirkham
  • Add aliases for string methods (#8353) @shwina
  • Update environment variable used to determine cuda_version (#8321) @ajschmidt8
  • JNI: Refactor the code of making column from scalar (#8310) @firestarman
  • Update CHANGELOG.md links for calver (#8303) @ajschmidt8
  • Merge branch-0.19 into branch-21.06 (#8302) @ajschmidt8
  • use address and length for GDS reads/writes (#8301) @rongou
  • Update cudfjni version to 21.06.0 (#8292) @pxLi
  • Update docs build script (#8284) @ajschmidt8
  • Make device_buffer streams explicit and enforce move construction (#8280) @harrism
  • Introduce a common parent class for NumericalColumn and DecimalColumn (#8278) @vyasr
  • Do not add nulls to the hash table when nullequality::NOTEQUAL is passed to leftsemijoin and leftantijoin (#8277) @nvdbaranec
  • Enable implicit casting when concatenating mixed types (#8276) @ChrisJar
  • Fix CMake FindPackage rmm, pin dev envs' dlpack to v0.3 (#8271) @trxcllnt
  • Update cudfjni version to 21.06 (#8267) @pxLi
  • support RMM aligned resource adapter in JNI (#8266) @rongou
  • Pass compiler environment variables to conda python build (#8260) @Ethyling
  • Remove abc inheritance from Serializable (#8254) @vyasr
  • Move more methods into SingleColumnFrame (#8253) @vyasr
  • Update ORC statistics API to use C++17 standard library (#8241) @vuule
  • Correct unused parameter warnings in dictonary algorithms (#8239) @robertmaynard
  • Correct unused parameters in the copying algorithms (#8232) @robertmaynard
  • IO statistics cleanup (#8191) @kaatish
  • Refactor of rolling_window implementation. (#8158) @nvdbaranec
  • Add a flag for allowing single quotes in JSON strings. (#8144) @nvdbaranec
  • Column refactoring 2 (#8130) @vyasr
  • support space in workspace (#7956) @jolorunyomi
  • Support collect_set on rolling window (#7881) @sperlingxx

- C++
Published by GPUtester over 4 years ago

https://github.com/rapidsai/cudf - v0.19.2

🚨 Breaking Changes

  • Allow hash_partition to take a seed value (#7771) @magnatelee
  • Allow merging index column with data column using keyword "on" (#7736) @skirui-source
  • Change JNI API to avoid loading native dependencies when creating sort order classes. (#7729) @revans2
  • Replace devicevector with deviceuvector in null_mask (#7715) @harrism
  • Don't identify decimals as strings. (#7710) @vyasr
  • Fix Java Parquet write after writer API changes (#7655) @revans2
  • Convert cudf::concatenate APIs to use spans and device_uvector (#7621) @harrism
  • Update missing docstring examples in python public APIs (#7546) @galipremsagar
  • Remove unneeded step parameter from strings::detail::copy_slice (#7525) @davidwendt
  • Rename ARROWSTATICLIB because it conflicts with one in FindArrow.cmake (#7518) @trxcllnt
  • Match Pandas logic for comparing two objects with nulls (#7490) @brandon-b-miller
  • Add struct support to parquet writer (#7461) @devavret
  • Join APIs that return gathermaps (#7454) @shwina
  • fixed_point + cudf::binary_operation API Changes (#7435) @codereport
  • Fix BUG: Exception when PYTHONOPTIMIZE=2 (#7434) @skirui-source
  • Change nvtext::loadvocabularyfile to return a unique ptr (#7424) @davidwendt
  • Refactor strings column factories (#7397) @harrism
  • Use CMAKECUDAARCHITECTURES (#7391) @robertmaynard
  • Upgrade pandas to 1.2 (#7375) @galipremsagar
  • Rename logical_cast to bit_cast and allow additional conversions (#7373) @ttnghia
  • Rework libcudf CMakeLists.txt to export targets for CPM (#7107) @trxcllnt

πŸ› Bug Fixes

  • unsnap: busy wait a number of cycles (#8073) @vuule
  • Fix returned column type when extracting from an empty list column (#8031) @jlowe
  • Don't reindex an new value on setitem if the original dataframe was empty (#8026) @vyasr
  • Fix a NameError in meta dispatch API (#7996) @galipremsagar
  • Reindex in DataFrame.__setitem__ (#7957) @galipremsagar
  • jitify direct-to-cubin compilation and caching. (#7919) @cwharris
  • Use dynamic cudart for nvcomp in java build (#7896) @abellina
  • fix "incompatible redefinition" warnings (#7894) @cwharris
  • cudf consistently specifies the cuda runtime (#7887) @robertmaynard
  • disable verbose output for jitify_preprocess (#7886) @cwharris
  • CMake jitpreprocessfiles function only runs when needed (#7872) @robertmaynard
  • Push DeviceScalar construction into cython for list.contains (#7864) @brandon-b-miller
  • cudf now sets an install rpath of $ORIGIN (#7863) @robertmaynard
  • Don't install Thrust examples, tests, docs, and python files (#7811) @robertmaynard
  • Sort by index in groupby tests more consistently (#7802) @shwina
  • Revert "Update conda recipes pinning of repo dependencies (#7743)" (#7793) @raydouglass
  • Add decimal column handling in copytypemetadata (#7788) @shwina
  • Add column names validation in parquet writer (#7786) @galipremsagar
  • Fix Java explode outer unit tests (#7782) @jlowe
  • Fix compiler warning about non-POD types passed through ellipsis (#7781) @jrhemstad
  • User resource fix for replace_nulls (#7769) @magnatelee
  • Fix type dispatch for columnar replace_nulls (#7768) @jlowe
  • Add ignore_order parameter to dask-cudf concat dispatch (#7765) @galipremsagar
  • Fix slicing and arrow representations of decimal columns (#7755) @vyasr
  • Fixing issue with explode_outer position not nulling position entries of null rows (#7754) @hyperbolic2346
  • Implement scatter for struct columns (#7752) @ttnghia
  • Fix data corruption in string columns (#7746) @galipremsagar
  • Fix string length in stripe dictionary building (#7744) @kaatish
  • Update conda recipes pinning of repo dependencies (#7743) @mike-wendt
  • Enable dask dispatch to cuDF's is_categorical_dtype for cuDF objects (#7740) @brandon-b-miller
  • Fix dictionary size computation in ORC writer (#7737) @vuule
  • Fix cudf::cast overflow for decimal64 to int32_t or smaller in certain cases (#7733) @codereport
  • Change JNI API to avoid loading native dependencies when creating sort order classes. (#7729) @revans2
  • Disable column_view data accessors for unsupported types (#7725) @jrhemstad
  • Materialize RangeIndex when index=True in parquet writer (#7711) @galipremsagar
  • Don't identify decimals as strings. (#7710) @vyasr
  • Fix return type of DataFrame.argsort (#7706) @galipremsagar
  • Fix/correct cudf installed package requirements (#7688) @robertmaynard
  • Fix SparkMurmurHash3_32 hash inconsistencies with Apache Spark (#7672) @jlowe
  • Fix ORC reader issue with reading empty string columns (#7656) @rgsl888prabhu
  • Fix Java Parquet write after writer API changes (#7655) @revans2
  • Fixing empty null lists throwing explode_outer for a loop. (#7649) @hyperbolic2346
  • Fix internal compiler error during JNI Docker build (#7645) @jlowe
  • Fix Debug build break with deviceuvectors in groupedrolling.cu (#7633) @mythrocks
  • Parquet reader: Fix issue when using skip_rows on non-nested columns containing nulls (#7627) @nvdbaranec
  • Fix ORC reader for empty DataFrame/Table (#7624) @rgsl888prabhu
  • Fix specifying GPU architecture in JNI build (#7612) @jlowe
  • Fix ORC writer OOM issue (#7605) @vuule
  • Fix 0.18 --> 0.19 automerge (#7589) @kkraus14
  • Fix ORC issue with incorrect timestamp nanosecond values (#7581) @vuule
  • Fix missing Dask imports (#7580) @kkraus14
  • CMAKECUDAARCHITECTURES doesn't change when build-system invokes cmake (#7579) @robertmaynard
  • Another fix for offsetsend() iterator in listscolumn_view (#7575) @ttnghia
  • Fix ORC writer output corruption with string columns (#7565) @vuule
  • Fix cudf::lists::sort_lists failing for sliced column (#7564) @ttnghia
  • FIX Fix Anaconda upload args (#7558) @dillon-cullinan
  • Fix index mismatch issue in equality related APIs (#7555) @galipremsagar
  • FIX Revert gpucicondaretry on conda file output locations (#7552) @dillon-cullinan
  • Fix offsetend iterator for listscolumn_view, which was not correctl… (#7551) @ttnghia
  • Fix no such file dlpack.h error when build libcudf (#7549) @chenrui17
  • Update missing docstring examples in python public APIs (#7546) @galipremsagar
  • Decimal32 Build Fix (#7544) @razajafri
  • FIX Retry conda output location (#7540) @dillon-cullinan
  • fix missing renames of dask git branches from master to main (#7535) @kkraus14
  • Remove detail from device_span (#7533) @rwlee
  • Change dask and distributed branch to main (#7532) @dantegd
  • Update JNI build to use CUDFUSEARROW_STATIC (#7526) @jlowe
  • Make sure rmm::rmm CMake target is visibile to cudf users (#7524) @robertmaynard
  • Fix contiguous_split not properly handling output partitions > 2 GB. (#7515) @nvdbaranec
  • Change jit launch to safe_launch (#7510) @devavret
  • Fix comparison between Datetime/Timedelta columns and NULL scalars (#7504) @brandon-b-miller
  • Fix off-by-one error in char-parallel string scalar replace (#7502) @jlowe
  • Fix JNI deprecation of all, put it on the wrong version before (#7501) @revans2
  • Fix Series/Dataframe Mixed Arithmetic (#7491) @brandon-b-miller
  • Fix JNI build after removal of libcudf sub-libraries (#7486) @jlowe
  • Correctly compile benchmarks (#7485) @robertmaynard
  • Fix bool column corruption with ORC Reader (#7483) @rgsl888prabhu
  • Fix __repr__ for categorical dtype (#7476) @galipremsagar
  • Java cleaner synchronization (#7474) @abellina
  • Fix java float/double parsing tests (#7473) @revans2
  • Pass stream and user resource to makedefaultconstructed_scalar (#7469) @magnatelee
  • Improve stability of daskcudf.DataFrame.var and daskcudf.DataFrame.std (#7453) @rjzamora
  • Missing device_storage_dispatch change affecting cudf::gather (#7449) @codereport
  • fix cuFile JNI compile errors (#7445) @rongou
  • Support Series.__setitem__ with key to a new row (#7443) @isVoid
  • Fix BUG: Exception when PYTHONOPTIMIZE=2 (#7434) @skirui-source
  • Make inclusive scan safe for cases with leading nulls (#7432) @magnatelee
  • Fix typo in listdeviceview::pairrepend() (#7423) @mythrocks
  • Fix string to double conversion and row equivalent comparison (#7410) @ttnghia
  • Fix thrust failure when transfering data from devicevector to hostvector with vectors of size 1 (#7382) @ttnghia
  • Fix std::exeception catch-by-reference gcc9 compile error (#7380) @davidwendt
  • Fix skiprows issue with ORC Reader (#7359) @rgsl888prabhu
  • fix Arrow CMake file (#7358) @rongou
  • Fix lists::contains() for NaN and Decimals (#7349) @mythrocks
  • Handle cupy array in Dataframe.__setitem__ (#7340) @galipremsagar
  • Fix invalid-device-fn error in cudf::strings::replace_re with multiple regex's (#7336) @davidwendt
  • FIX Add codecov upload block to gpu script (#6860) @dillon-cullinan

πŸ“– Documentation

  • Fix join API doxygen (#7890) @shwina
  • Add Resources to README. (#7697) @bdice
  • Add isin examples in Docstring (#7479) @galipremsagar
  • Resolving unlinked type shorthands in cudf doc (#7416) @isVoid
  • Fix typo in regex.md doc page (#7363) @davidwendt
  • Fix incorrect stringscolumnview::chars_size documentation (#7360) @jlowe

πŸš€ New Features

  • Enable basic reductions for decimal columns (#7776) @ChrisJar
  • Enable join on decimal columns (#7764) @ChrisJar
  • Allow merging index column with data column using keyword "on" (#7736) @skirui-source
  • Implement DecimalColumn + Scalar and add cudf.Scalars of Decimal64Dtype (#7732) @brandon-b-miller
  • Add support for unique groupby aggregation (#7726) @shwina
  • Expose libcudf's label_bins function to cudf (#7724) @vyasr
  • Adding support for equi-join on struct (#7720) @hyperbolic2346
  • Add decimal column comparison operations (#7716) @isVoid
  • Implement scan operations for decimal columns (#7707) @ChrisJar
  • Enable typecasting between decimal and int (#7691) @ChrisJar
  • Enable decimal support in parquet writer (#7673) @devavret
  • Adds list.unique API (#7664) @isVoid
  • Fix NaN handling in droplistduplicates (#7662) @ttnghia
  • Add lists.sort_values API (#7657) @isVoid
  • Add is_integer API that can check for the validity of a string-to-integer conversion (#7642) @ttnghia
  • Adds explode API (#7607) @isVoid
  • Adds list.take, python binding for cudf::lists::segmented_gather (#7591) @isVoid
  • Implement cudf::label_bins() (#7554) @vyasr
  • Add Python bindings for lists::contains (#7547) @skirui-source
  • cudf::rowbitcount() support. (#7534) @nvdbaranec
  • Implement droplistduplicates (#7528) @ttnghia
  • Add Python bindings for lists::extract_lists_element (#7505) @skirui-source
  • Add explodeouter and explodeouter_position (#7499) @hyperbolic2346
  • Match Pandas logic for comparing two objects with nulls (#7490) @brandon-b-miller
  • Add struct support to parquet writer (#7461) @devavret
  • Enable type conversion from float to decimal type (#7450) @ChrisJar
  • Add cython for converting strings/fixed-point functions (#7429) @davidwendt
  • Add struct column support to cudf::sort and cudf::sorted_order (#7422) @karthikeyann
  • Implement groupby collect_set (#7420) @ttnghia
  • Merge branch-0.18 into branch-0.19 (#7411) @raydouglass
  • Refactor strings column factories (#7397) @harrism
  • Add groupby scan operations (sort groupby) (#7387) @karthikeyann
  • Add cudf::explode_position (#7376) @hyperbolic2346
  • Add string conversion to/from decimal values libcudf APIs (#7364) @davidwendt
  • Add groupby SUMOFSQUARES support (#7362) @karthikeyann
  • Add Series.drop api (#7304) @isVoid
  • getjsonobject() implementation (#7286) @nvdbaranec
  • Python API for LIstMethods.len() (#7283) @isVoid
  • Support null_policy::EXCLUDE for COLLECT rolling aggregation (#7264) @mythrocks
  • Add support for special tokens in nvtext::subword_tokenizer (#7254) @davidwendt
  • Fix inplace update of data and add Series.update (#7201) @galipremsagar
  • Implement cudf::group_by (hash) for decimal32 and decimal64 (#7190) @codereport
  • Adding support to specify "level" parameter for Dataframe.rename (#7135) @skirui-source

πŸ› οΈ Improvements

  • fix GDS include path for version 0.95 (#7877) @rongou
  • Update dask + distributed to 2021.4.0 (#7858) @jakirkham
  • Add ability to extract include dirs from CUDF_HOME (#7848) @galipremsagar
  • Add USE_GDS as an option in build script (#7833) @pxLi
  • add an allocate method with stream in java DeviceMemoryBuffer (#7826) @rongou
  • Constrain dask and distributed versions to 2021.3.1 (#7825) @shwina
  • Revert dask versioning of concat dispatch (#7823) @galipremsagar
  • add copy methods in Java memory buffer (#7791) @rongou
  • Update README and CONTRIBUTING for 0.19 (#7778) @robertmaynard
  • Allow hash_partition to take a seed value (#7771) @magnatelee
  • Turn on NVTX by default in java build (#7761) @tgravescs
  • Add Java bindings to join gather map APIs (#7751) @jlowe
  • Add replacements column support for Java replaceNulls (#7750) @jlowe
  • Add Java bindings for rowbitcount (#7749) @jlowe
  • Remove unused JVM array creation (#7748) @jlowe
  • Added JNI support for new is_integer (#7739) @revans2
  • Create and promote library aliases in libcudf installations (#7734) @trxcllnt
  • Support groupby operations for decimal dtypes (#7731) @vyasr
  • Memory map the input file only when GDS compatiblity mode is not used (#7717) @vuule
  • Replace devicevector with deviceuvector in null_mask (#7715) @harrism
  • Struct hashing support for SerialMurmur3 and SparkMurmur3 (#7714) @jlowe
  • Add gbenchmark for nvtext replace-tokens function (#7708) @davidwendt
  • Use stream in groupby calls (#7705) @karthikeyann
  • Update codeowners file (#7701) @ajschmidt8
  • Cleanup groupby to use hostspan, devicespan, device_uvector (#7698) @karthikeyann
  • Add gbenchmark for nvtext ngrams functions (#7693) @davidwendt
  • Misc Python/Cython optimizations (#7686) @shwina
  • Add gbenchmark for nvtext tokenize functions (#7684) @davidwendt
  • Add columndeviceview to orc writer (#7676) @kaatish
  • cudf_kafka now uses cuDF CMake export targets (CPM) (#7674) @robertmaynard
  • Add gbenchmark for nvtext normalize functions (#7668) @davidwendt
  • Resolve unnecessary import of thrust/optional.hpp in types.hpp (#7667) @vyasr
  • Feature/optimize accessor copy (#7660) @vyasr
  • Fix find_package(cudf) (#7658) @trxcllnt
  • Work-around for gcc7 compile error on Centos7 (#7652) @davidwendt
  • Add in JNI support for count_elements (#7651) @revans2
  • Fix issues with building cudf in a non-conda environment (#7647) @galipremsagar
  • Refactor ConfigureCUDA to not conditionally insert compiler flags (#7643) @robertmaynard
  • Add gbenchmark for converting strings to/from timestamps (#7641) @davidwendt
  • Handle constructing a cudf.Scalar from a cudf.Scalar (#7639) @shwina
  • Add in JNI support for table partition (#7637) @revans2
  • Add explicit fixed_point merge test (#7635) @codereport
  • Add JNI support for IDENTITY hash partitioning (#7626) @revans2
  • Java support on explode_outer (#7625) @sperlingxx
  • Java support of casting string from/to decimal (#7623) @sperlingxx
  • Convert cudf::concatenate APIs to use spans and device_uvector (#7621) @harrism
  • Add gbenchmark for cudf::strings::translate function (#7617) @davidwendt
  • Use file(COPY ) over file(INSTALL ) so cmake output is reduced (#7616) @robertmaynard
  • Use rmm::deviceuvector in place of rmm::devicevector for ORC reader/writer and cudf::io::column_buffer (#7614) @vuule
  • Refactor Java host-side buffer concatenation to expose separate steps (#7610) @jlowe
  • Add gbenchmarks for string substrings functions (#7603) @davidwendt
  • Refactor string conversion check (#7599) @ttnghia
  • JNI: Pass names of children struct columns to native Arrow IPC writer (#7598) @firestarman
  • Revert "ENH Fix stale GHA and prevent duplicates " (#7595) @mike-wendt
  • ENH Fix stale GHA and prevent duplicates (#7594) @mike-wendt
  • Fix auto-detecting GPU architectures (#7593) @trxcllnt
  • Reduce cudf library size (#7583) @robertmaynard
  • Optimize cudf::makestringscolumn for long strings (#7576) @davidwendt
  • Always build and export the cudf::cudftestutil target (#7574) @trxcllnt
  • Eliminate literal parameters to uvector::setelementasync and devicescalar::setvalue (#7563) @harrism
  • Add gbenchmark for strings::concatenate (#7560) @davidwendt
  • Update Changelog Link (#7550) @ajschmidt8
  • Add gbenchmarks for strings replace regex functions (#7541) @davidwendt
  • Add __repr__ for Column and ColumnAccessor (#7531) @shwina
  • Support Decimal DIV changes in cudf (#7527) @razajafri
  • Remove unneeded step parameter from strings::detail::copy_slice (#7525) @davidwendt
  • Use deviceuvector, devicespan in sort groupby (#7523) @karthikeyann
  • Add gbenchmarks for strings extract function (#7522) @davidwendt
  • Rename ARROWSTATICLIB because it conflicts with one in FindArrow.cmake (#7518) @trxcllnt
  • Reduce compile time/size for scan.cu (#7516) @davidwendt
  • Change devicevector to deviceuvector in nvtext source files (#7512) @davidwendt
  • Removed unneeded includes from traits.hpp (#7509) @davidwendt
  • FIX Remove random build directory generation for ccache (#7508) @dillon-cullinan
  • xfail failing pytest in pandas 1.2.3 (#7507) @galipremsagar
  • JNI bit cast (#7493) @revans2
  • Combine rolling window function tests (#7480) @mythrocks
  • Prepare Changelog for Automation (#7477) @ajschmidt8
  • Java support for explode position (#7471) @sperlingxx
  • Update 0.18 changelog entry (#7463) @ajschmidt8
  • JNI: Support skipping nulls for collect aggregation (#7457) @firestarman
  • Join APIs that return gathermaps (#7454) @shwina
  • Remove dependence on managed memory for multimap test (#7451) @jrhemstad
  • Use cuFile for Parquet IO when available (#7444) @vuule
  • Statistics cleanup (#7439) @kaatish
  • Add gbenchmarks for strings filter functions (#7438) @davidwendt
  • fixed_point + cudf::binary_operation API Changes (#7435) @codereport
  • Improve string gather performance (#7433) @jlowe
  • Don't use user resource for a temporary allocation in sortbykey (#7431) @magnatelee
  • Detail APIs for datetime functions (#7430) @magnatelee
  • Replace thrust::maxelement with thrust::reduce in strings findallre (#7428) @davidwendt
  • Add gbenchmark for strings split/split_record functions (#7427) @davidwendt
  • Update JNI build to use CMAKECUDAARCHITECTURES (#7425) @jlowe
  • Change nvtext::loadvocabularyfile to return a unique ptr (#7424) @davidwendt
  • Simplify type dispatch with device_storage_dispatch (#7419) @codereport
  • Java support for casting of nested child columns (#7417) @razajafri
  • Improve scalar string replace performance for long strings (#7415) @jlowe
  • Remove unneeded temporary device vector for strings scatter specialization (#7409) @davidwendt
  • bitmask_or implementation with bitmask refactor (#7406) @rwlee
  • Add other cudf::strings::replace functions to current strings replace gbenchmark (#7403) @davidwendt
  • Clean up included headers in device_operators.cuh (#7401) @codereport
  • Move nullable index iterator to indexalator factory (#7399) @davidwendt
  • ENH Pass ccache variables to conda recipe & use Ninja in CI (#7398) @Ethyling
  • upgrade maven-antrun-plugin to support maven parallel builds (#7393) @rongou
  • Add gbenchmark for strings find/contains functions (#7392) @davidwendt
  • Use CMAKECUDAARCHITECTURES (#7391) @robertmaynard
  • Refactor libcudf strings::replace to use makestringschildren utility (#7384) @davidwendt
  • Added in JNI support for out of core sort algorithm (#7381) @revans2
  • Upgrade pandas to 1.2 (#7375) @galipremsagar
  • Rename logical_cast to bit_cast and allow additional conversions (#7373) @ttnghia
  • jitify 2 support (#7372) @cwharris
  • compile_udf: Cache PTX for similar functions (#7371) @gmarkall
  • Add string scalar replace benchmark (#7369) @jlowe
  • Add gbenchmark for strings containsre/countre functions (#7366) @davidwendt
  • Update orc reader and writer fuzz tests (#7357) @galipremsagar
  • Improve url_decode performance for long strings (#7353) @jlowe
  • cudf::ast Small Refactorings (#7352) @codereport
  • Remove std::cout and print in the scatter test function EmptyListsOfNullableStrings. (#7342) @ttnghia
  • Use cudf::detail::make_counting_transform_iterator (#7338) @codereport
  • Change block size parameter from a global to a template param. (#7333) @nvdbaranec
  • Partial clean up of ORC writer (#7324) @vuule
  • Add gbenchmark for cudf::strings::to_lower (#7316) @davidwendt
  • Update Java bindings version to 0.19-SNAPSHOT (#7307) @pxLi
  • Move cudf::test::make_counting_transform_iterator to cudf/detail/iterator.cuh (#7306) @codereport
  • Use string literals in fixed_point release_asserts (#7303) @codereport
  • Fix merge conflicts for #7295 (#7297) @ajschmidt8
  • Add UTF-8 chars to createrandomcolumn<string_view> benchmark utility (#7292) @davidwendt
  • Abstracting block reduce and block scan from cuIO kernels with cub apis (#7278) @rgsl888prabhu
  • Build.sh use cmake --build to drive build system invocation (#7270) @robertmaynard
  • Refactor dictionary support for reductions any/all (#7242) @davidwendt
  • Replace stream.value() with stream for stream_view args (#7236) @karthikeyann
  • Interval index and interval_range (#7182) @marlenezw
  • avro reader integration tests (#7156) @cwharris
  • Rework libcudf CMakeLists.txt to export targets for CPM (#7107) @trxcllnt
  • Adding Interval Dtype (#6984) @marlenezw
  • Cleaning up for loops with make_(counting_)transform_iterator (#6546) @codereport

- C++
Published by GPUtester almost 5 years ago

https://github.com/rapidsai/cudf - v0.19.1

🚨 Breaking Changes

  • Allow hash_partition to take a seed value (#7771) @magnatelee
  • Allow merging index column with data column using keyword "on" (#7736) @skirui-source
  • Change JNI API to avoid loading native dependencies when creating sort order classes. (#7729) @revans2
  • Replace devicevector with deviceuvector in null_mask (#7715) @harrism
  • Don't identify decimals as strings. (#7710) @vyasr
  • Fix Java Parquet write after writer API changes (#7655) @revans2
  • Convert cudf::concatenate APIs to use spans and device_uvector (#7621) @harrism
  • Update missing docstring examples in python public APIs (#7546) @galipremsagar
  • Remove unneeded step parameter from strings::detail::copy_slice (#7525) @davidwendt
  • Rename ARROWSTATICLIB because it conflicts with one in FindArrow.cmake (#7518) @trxcllnt
  • Match Pandas logic for comparing two objects with nulls (#7490) @brandon-b-miller
  • Add struct support to parquet writer (#7461) @devavret
  • Join APIs that return gathermaps (#7454) @shwina
  • fixed_point + cudf::binary_operation API Changes (#7435) @codereport
  • Fix BUG: Exception when PYTHONOPTIMIZE=2 (#7434) @skirui-source
  • Change nvtext::loadvocabularyfile to return a unique ptr (#7424) @davidwendt
  • Refactor strings column factories (#7397) @harrism
  • Use CMAKECUDAARCHITECTURES (#7391) @robertmaynard
  • Upgrade pandas to 1.2 (#7375) @galipremsagar
  • Rename logical_cast to bit_cast and allow additional conversions (#7373) @ttnghia
  • Rework libcudf CMakeLists.txt to export targets for CPM (#7107) @trxcllnt

πŸ› Bug Fixes

  • Fix returned column type when extracting from an empty list column (#8031) @jlowe
  • Don't reindex an new value on setitem if the original dataframe was empty (#8026) @vyasr
  • Fix a NameError in meta dispatch API (#7996) @galipremsagar
  • Reindex in DataFrame.__setitem__ (#7957) @galipremsagar
  • jitify direct-to-cubin compilation and caching. (#7919) @cwharris
  • Use dynamic cudart for nvcomp in java build (#7896) @abellina
  • fix "incompatible redefinition" warnings (#7894) @cwharris
  • cudf consistently specifies the cuda runtime (#7887) @robertmaynard
  • disable verbose output for jitify_preprocess (#7886) @cwharris
  • CMake jitpreprocessfiles function only runs when needed (#7872) @robertmaynard
  • Push DeviceScalar construction into cython for list.contains (#7864) @brandon-b-miller
  • cudf now sets an install rpath of $ORIGIN (#7863) @robertmaynard
  • Don't install Thrust examples, tests, docs, and python files (#7811) @robertmaynard
  • Sort by index in groupby tests more consistently (#7802) @shwina
  • Revert "Update conda recipes pinning of repo dependencies (#7743)" (#7793) @raydouglass
  • Add decimal column handling in copytypemetadata (#7788) @shwina
  • Add column names validation in parquet writer (#7786) @galipremsagar
  • Fix Java explode outer unit tests (#7782) @jlowe
  • Fix compiler warning about non-POD types passed through ellipsis (#7781) @jrhemstad
  • User resource fix for replace_nulls (#7769) @magnatelee
  • Fix type dispatch for columnar replace_nulls (#7768) @jlowe
  • Add ignore_order parameter to dask-cudf concat dispatch (#7765) @galipremsagar
  • Fix slicing and arrow representations of decimal columns (#7755) @vyasr
  • Fixing issue with explode_outer position not nulling position entries of null rows (#7754) @hyperbolic2346
  • Implement scatter for struct columns (#7752) @ttnghia
  • Fix data corruption in string columns (#7746) @galipremsagar
  • Fix string length in stripe dictionary building (#7744) @kaatish
  • Update conda recipes pinning of repo dependencies (#7743) @mike-wendt
  • Enable dask dispatch to cuDF's is_categorical_dtype for cuDF objects (#7740) @brandon-b-miller
  • Fix dictionary size computation in ORC writer (#7737) @vuule
  • Fix cudf::cast overflow for decimal64 to int32_t or smaller in certain cases (#7733) @codereport
  • Change JNI API to avoid loading native dependencies when creating sort order classes. (#7729) @revans2
  • Disable column_view data accessors for unsupported types (#7725) @jrhemstad
  • Materialize RangeIndex when index=True in parquet writer (#7711) @galipremsagar
  • Don't identify decimals as strings. (#7710) @vyasr
  • Fix return type of DataFrame.argsort (#7706) @galipremsagar
  • Fix/correct cudf installed package requirements (#7688) @robertmaynard
  • Fix SparkMurmurHash3_32 hash inconsistencies with Apache Spark (#7672) @jlowe
  • Fix ORC reader issue with reading empty string columns (#7656) @rgsl888prabhu
  • Fix Java Parquet write after writer API changes (#7655) @revans2
  • Fixing empty null lists throwing explode_outer for a loop. (#7649) @hyperbolic2346
  • Fix internal compiler error during JNI Docker build (#7645) @jlowe
  • Fix Debug build break with deviceuvectors in groupedrolling.cu (#7633) @mythrocks
  • Parquet reader: Fix issue when using skip_rows on non-nested columns containing nulls (#7627) @nvdbaranec
  • Fix ORC reader for empty DataFrame/Table (#7624) @rgsl888prabhu
  • Fix specifying GPU architecture in JNI build (#7612) @jlowe
  • Fix ORC writer OOM issue (#7605) @vuule
  • Fix 0.18 --> 0.19 automerge (#7589) @kkraus14
  • Fix ORC issue with incorrect timestamp nanosecond values (#7581) @vuule
  • Fix missing Dask imports (#7580) @kkraus14
  • CMAKECUDAARCHITECTURES doesn't change when build-system invokes cmake (#7579) @robertmaynard
  • Another fix for offsetsend() iterator in listscolumn_view (#7575) @ttnghia
  • Fix ORC writer output corruption with string columns (#7565) @vuule
  • Fix cudf::lists::sort_lists failing for sliced column (#7564) @ttnghia
  • FIX Fix Anaconda upload args (#7558) @dillon-cullinan
  • Fix index mismatch issue in equality related APIs (#7555) @galipremsagar
  • FIX Revert gpucicondaretry on conda file output locations (#7552) @dillon-cullinan
  • Fix offsetend iterator for listscolumn_view, which was not correctl… (#7551) @ttnghia
  • Fix no such file dlpack.h error when build libcudf (#7549) @chenrui17
  • Update missing docstring examples in python public APIs (#7546) @galipremsagar
  • Decimal32 Build Fix (#7544) @razajafri
  • FIX Retry conda output location (#7540) @dillon-cullinan
  • fix missing renames of dask git branches from master to main (#7535) @kkraus14
  • Remove detail from device_span (#7533) @rwlee
  • Change dask and distributed branch to main (#7532) @dantegd
  • Update JNI build to use CUDFUSEARROW_STATIC (#7526) @jlowe
  • Make sure rmm::rmm CMake target is visibile to cudf users (#7524) @robertmaynard
  • Fix contiguous_split not properly handling output partitions > 2 GB. (#7515) @nvdbaranec
  • Change jit launch to safe_launch (#7510) @devavret
  • Fix comparison between Datetime/Timedelta columns and NULL scalars (#7504) @brandon-b-miller
  • Fix off-by-one error in char-parallel string scalar replace (#7502) @jlowe
  • Fix JNI deprecation of all, put it on the wrong version before (#7501) @revans2
  • Fix Series/Dataframe Mixed Arithmetic (#7491) @brandon-b-miller
  • Fix JNI build after removal of libcudf sub-libraries (#7486) @jlowe
  • Correctly compile benchmarks (#7485) @robertmaynard
  • Fix bool column corruption with ORC Reader (#7483) @rgsl888prabhu
  • Fix __repr__ for categorical dtype (#7476) @galipremsagar
  • Java cleaner synchronization (#7474) @abellina
  • Fix java float/double parsing tests (#7473) @revans2
  • Pass stream and user resource to makedefaultconstructed_scalar (#7469) @magnatelee
  • Improve stability of daskcudf.DataFrame.var and daskcudf.DataFrame.std (#7453) @rjzamora
  • Missing device_storage_dispatch change affecting cudf::gather (#7449) @codereport
  • fix cuFile JNI compile errors (#7445) @rongou
  • Support Series.__setitem__ with key to a new row (#7443) @isVoid
  • Fix BUG: Exception when PYTHONOPTIMIZE=2 (#7434) @skirui-source
  • Make inclusive scan safe for cases with leading nulls (#7432) @magnatelee
  • Fix typo in listdeviceview::pairrepend() (#7423) @mythrocks
  • Fix string to double conversion and row equivalent comparison (#7410) @ttnghia
  • Fix thrust failure when transfering data from devicevector to hostvector with vectors of size 1 (#7382) @ttnghia
  • Fix std::exeception catch-by-reference gcc9 compile error (#7380) @davidwendt
  • Fix skiprows issue with ORC Reader (#7359) @rgsl888prabhu
  • fix Arrow CMake file (#7358) @rongou
  • Fix lists::contains() for NaN and Decimals (#7349) @mythrocks
  • Handle cupy array in Dataframe.__setitem__ (#7340) @galipremsagar
  • Fix invalid-device-fn error in cudf::strings::replace_re with multiple regex's (#7336) @davidwendt
  • FIX Add codecov upload block to gpu script (#6860) @dillon-cullinan

πŸ“– Documentation

  • Fix join API doxygen (#7890) @shwina
  • Add Resources to README. (#7697) @bdice
  • Add isin examples in Docstring (#7479) @galipremsagar
  • Resolving unlinked type shorthands in cudf doc (#7416) @isVoid
  • Fix typo in regex.md doc page (#7363) @davidwendt
  • Fix incorrect stringscolumnview::chars_size documentation (#7360) @jlowe

πŸš€ New Features

  • Enable basic reductions for decimal columns (#7776) @ChrisJar
  • Enable join on decimal columns (#7764) @ChrisJar
  • Allow merging index column with data column using keyword "on" (#7736) @skirui-source
  • Implement DecimalColumn + Scalar and add cudf.Scalars of Decimal64Dtype (#7732) @brandon-b-miller
  • Add support for unique groupby aggregation (#7726) @shwina
  • Expose libcudf's label_bins function to cudf (#7724) @vyasr
  • Adding support for equi-join on struct (#7720) @hyperbolic2346
  • Add decimal column comparison operations (#7716) @isVoid
  • Implement scan operations for decimal columns (#7707) @ChrisJar
  • Enable typecasting between decimal and int (#7691) @ChrisJar
  • Enable decimal support in parquet writer (#7673) @devavret
  • Adds list.unique API (#7664) @isVoid
  • Fix NaN handling in droplistduplicates (#7662) @ttnghia
  • Add lists.sort_values API (#7657) @isVoid
  • Add is_integer API that can check for the validity of a string-to-integer conversion (#7642) @ttnghia
  • Adds explode API (#7607) @isVoid
  • Adds list.take, python binding for cudf::lists::segmented_gather (#7591) @isVoid
  • Implement cudf::label_bins() (#7554) @vyasr
  • Add Python bindings for lists::contains (#7547) @skirui-source
  • cudf::rowbitcount() support. (#7534) @nvdbaranec
  • Implement droplistduplicates (#7528) @ttnghia
  • Add Python bindings for lists::extract_lists_element (#7505) @skirui-source
  • Add explodeouter and explodeouter_position (#7499) @hyperbolic2346
  • Match Pandas logic for comparing two objects with nulls (#7490) @brandon-b-miller
  • Add struct support to parquet writer (#7461) @devavret
  • Enable type conversion from float to decimal type (#7450) @ChrisJar
  • Add cython for converting strings/fixed-point functions (#7429) @davidwendt
  • Add struct column support to cudf::sort and cudf::sorted_order (#7422) @karthikeyann
  • Implement groupby collect_set (#7420) @ttnghia
  • Merge branch-0.18 into branch-0.19 (#7411) @raydouglass
  • Refactor strings column factories (#7397) @harrism
  • Add groupby scan operations (sort groupby) (#7387) @karthikeyann
  • Add cudf::explode_position (#7376) @hyperbolic2346
  • Add string conversion to/from decimal values libcudf APIs (#7364) @davidwendt
  • Add groupby SUMOFSQUARES support (#7362) @karthikeyann
  • Add Series.drop api (#7304) @isVoid
  • getjsonobject() implementation (#7286) @nvdbaranec
  • Python API for LIstMethods.len() (#7283) @isVoid
  • Support null_policy::EXCLUDE for COLLECT rolling aggregation (#7264) @mythrocks
  • Add support for special tokens in nvtext::subword_tokenizer (#7254) @davidwendt
  • Fix inplace update of data and add Series.update (#7201) @galipremsagar
  • Implement cudf::group_by (hash) for decimal32 and decimal64 (#7190) @codereport
  • Adding support to specify "level" parameter for Dataframe.rename (#7135) @skirui-source

πŸ› οΈ Improvements

  • fix GDS include path for version 0.95 (#7877) @rongou
  • Update dask + distributed to 2021.4.0 (#7858) @jakirkham
  • Add ability to extract include dirs from CUDF_HOME (#7848) @galipremsagar
  • Add USE_GDS as an option in build script (#7833) @pxLi
  • add an allocate method with stream in java DeviceMemoryBuffer (#7826) @rongou
  • Constrain dask and distributed versions to 2021.3.1 (#7825) @shwina
  • Revert dask versioning of concat dispatch (#7823) @galipremsagar
  • add copy methods in Java memory buffer (#7791) @rongou
  • Update README and CONTRIBUTING for 0.19 (#7778) @robertmaynard
  • Allow hash_partition to take a seed value (#7771) @magnatelee
  • Turn on NVTX by default in java build (#7761) @tgravescs
  • Add Java bindings to join gather map APIs (#7751) @jlowe
  • Add replacements column support for Java replaceNulls (#7750) @jlowe
  • Add Java bindings for rowbitcount (#7749) @jlowe
  • Remove unused JVM array creation (#7748) @jlowe
  • Added JNI support for new is_integer (#7739) @revans2
  • Create and promote library aliases in libcudf installations (#7734) @trxcllnt
  • Support groupby operations for decimal dtypes (#7731) @vyasr
  • Memory map the input file only when GDS compatiblity mode is not used (#7717) @vuule
  • Replace devicevector with deviceuvector in null_mask (#7715) @harrism
  • Struct hashing support for SerialMurmur3 and SparkMurmur3 (#7714) @jlowe
  • Add gbenchmark for nvtext replace-tokens function (#7708) @davidwendt
  • Use stream in groupby calls (#7705) @karthikeyann
  • Update codeowners file (#7701) @ajschmidt8
  • Cleanup groupby to use hostspan, devicespan, device_uvector (#7698) @karthikeyann
  • Add gbenchmark for nvtext ngrams functions (#7693) @davidwendt
  • Misc Python/Cython optimizations (#7686) @shwina
  • Add gbenchmark for nvtext tokenize functions (#7684) @davidwendt
  • Add columndeviceview to orc writer (#7676) @kaatish
  • cudf_kafka now uses cuDF CMake export targets (CPM) (#7674) @robertmaynard
  • Add gbenchmark for nvtext normalize functions (#7668) @davidwendt
  • Resolve unnecessary import of thrust/optional.hpp in types.hpp (#7667) @vyasr
  • Feature/optimize accessor copy (#7660) @vyasr
  • Fix find_package(cudf) (#7658) @trxcllnt
  • Work-around for gcc7 compile error on Centos7 (#7652) @davidwendt
  • Add in JNI support for count_elements (#7651) @revans2
  • Fix issues with building cudf in a non-conda environment (#7647) @galipremsagar
  • Refactor ConfigureCUDA to not conditionally insert compiler flags (#7643) @robertmaynard
  • Add gbenchmark for converting strings to/from timestamps (#7641) @davidwendt
  • Handle constructing a cudf.Scalar from a cudf.Scalar (#7639) @shwina
  • Add in JNI support for table partition (#7637) @revans2
  • Add explicit fixed_point merge test (#7635) @codereport
  • Add JNI support for IDENTITY hash partitioning (#7626) @revans2
  • Java support on explode_outer (#7625) @sperlingxx
  • Java support of casting string from/to decimal (#7623) @sperlingxx
  • Convert cudf::concatenate APIs to use spans and device_uvector (#7621) @harrism
  • Add gbenchmark for cudf::strings::translate function (#7617) @davidwendt
  • Use file(COPY ) over file(INSTALL ) so cmake output is reduced (#7616) @robertmaynard
  • Use rmm::deviceuvector in place of rmm::devicevector for ORC reader/writer and cudf::io::column_buffer (#7614) @vuule
  • Refactor Java host-side buffer concatenation to expose separate steps (#7610) @jlowe
  • Add gbenchmarks for string substrings functions (#7603) @davidwendt
  • Refactor string conversion check (#7599) @ttnghia
  • JNI: Pass names of children struct columns to native Arrow IPC writer (#7598) @firestarman
  • Revert "ENH Fix stale GHA and prevent duplicates " (#7595) @mike-wendt
  • ENH Fix stale GHA and prevent duplicates (#7594) @mike-wendt
  • Fix auto-detecting GPU architectures (#7593) @trxcllnt
  • Reduce cudf library size (#7583) @robertmaynard
  • Optimize cudf::makestringscolumn for long strings (#7576) @davidwendt
  • Always build and export the cudf::cudftestutil target (#7574) @trxcllnt
  • Eliminate literal parameters to uvector::setelementasync and devicescalar::setvalue (#7563) @harrism
  • Add gbenchmark for strings::concatenate (#7560) @davidwendt
  • Update Changelog Link (#7550) @ajschmidt8
  • Add gbenchmarks for strings replace regex functions (#7541) @davidwendt
  • Add __repr__ for Column and ColumnAccessor (#7531) @shwina
  • Support Decimal DIV changes in cudf (#7527) @razajafri
  • Remove unneeded step parameter from strings::detail::copy_slice (#7525) @davidwendt
  • Use deviceuvector, devicespan in sort groupby (#7523) @karthikeyann
  • Add gbenchmarks for strings extract function (#7522) @davidwendt
  • Rename ARROWSTATICLIB because it conflicts with one in FindArrow.cmake (#7518) @trxcllnt
  • Reduce compile time/size for scan.cu (#7516) @davidwendt
  • Change devicevector to deviceuvector in nvtext source files (#7512) @davidwendt
  • Removed unneeded includes from traits.hpp (#7509) @davidwendt
  • FIX Remove random build directory generation for ccache (#7508) @dillon-cullinan
  • xfail failing pytest in pandas 1.2.3 (#7507) @galipremsagar
  • JNI bit cast (#7493) @revans2
  • Combine rolling window function tests (#7480) @mythrocks
  • Prepare Changelog for Automation (#7477) @ajschmidt8
  • Java support for explode position (#7471) @sperlingxx
  • Update 0.18 changelog entry (#7463) @ajschmidt8
  • JNI: Support skipping nulls for collect aggregation (#7457) @firestarman
  • Join APIs that return gathermaps (#7454) @shwina
  • Remove dependence on managed memory for multimap test (#7451) @jrhemstad
  • Use cuFile for Parquet IO when available (#7444) @vuule
  • Statistics cleanup (#7439) @kaatish
  • Add gbenchmarks for strings filter functions (#7438) @davidwendt
  • fixed_point + cudf::binary_operation API Changes (#7435) @codereport
  • Improve string gather performance (#7433) @jlowe
  • Don't use user resource for a temporary allocation in sortbykey (#7431) @magnatelee
  • Detail APIs for datetime functions (#7430) @magnatelee
  • Replace thrust::maxelement with thrust::reduce in strings findallre (#7428) @davidwendt
  • Add gbenchmark for strings split/split_record functions (#7427) @davidwendt
  • Update JNI build to use CMAKECUDAARCHITECTURES (#7425) @jlowe
  • Change nvtext::loadvocabularyfile to return a unique ptr (#7424) @davidwendt
  • Simplify type dispatch with device_storage_dispatch (#7419) @codereport
  • Java support for casting of nested child columns (#7417) @razajafri
  • Improve scalar string replace performance for long strings (#7415) @jlowe
  • Remove unneeded temporary device vector for strings scatter specialization (#7409) @davidwendt
  • bitmask_or implementation with bitmask refactor (#7406) @rwlee
  • Add other cudf::strings::replace functions to current strings replace gbenchmark (#7403) @davidwendt
  • Clean up included headers in device_operators.cuh (#7401) @codereport
  • Move nullable index iterator to indexalator factory (#7399) @davidwendt
  • ENH Pass ccache variables to conda recipe & use Ninja in CI (#7398) @Ethyling
  • upgrade maven-antrun-plugin to support maven parallel builds (#7393) @rongou
  • Add gbenchmark for strings find/contains functions (#7392) @davidwendt
  • Use CMAKECUDAARCHITECTURES (#7391) @robertmaynard
  • Refactor libcudf strings::replace to use makestringschildren utility (#7384) @davidwendt
  • Added in JNI support for out of core sort algorithm (#7381) @revans2
  • Upgrade pandas to 1.2 (#7375) @galipremsagar
  • Rename logical_cast to bit_cast and allow additional conversions (#7373) @ttnghia
  • jitify 2 support (#7372) @cwharris
  • compile_udf: Cache PTX for similar functions (#7371) @gmarkall
  • Add string scalar replace benchmark (#7369) @jlowe
  • Add gbenchmark for strings containsre/countre functions (#7366) @davidwendt
  • Update orc reader and writer fuzz tests (#7357) @galipremsagar
  • Improve url_decode performance for long strings (#7353) @jlowe
  • cudf::ast Small Refactorings (#7352) @codereport
  • Remove std::cout and print in the scatter test function EmptyListsOfNullableStrings. (#7342) @ttnghia
  • Use cudf::detail::make_counting_transform_iterator (#7338) @codereport
  • Change block size parameter from a global to a template param. (#7333) @nvdbaranec
  • Partial clean up of ORC writer (#7324) @vuule
  • Add gbenchmark for cudf::strings::to_lower (#7316) @davidwendt
  • Update Java bindings version to 0.19-SNAPSHOT (#7307) @pxLi
  • Move cudf::test::make_counting_transform_iterator to cudf/detail/iterator.cuh (#7306) @codereport
  • Use string literals in fixed_point release_asserts (#7303) @codereport
  • Fix merge conflicts for #7295 (#7297) @ajschmidt8
  • Add UTF-8 chars to createrandomcolumn<string_view> benchmark utility (#7292) @davidwendt
  • Abstracting block reduce and block scan from cuIO kernels with cub apis (#7278) @rgsl888prabhu
  • Build.sh use cmake --build to drive build system invocation (#7270) @robertmaynard
  • Refactor dictionary support for reductions any/all (#7242) @davidwendt
  • Replace stream.value() with stream for stream_view args (#7236) @karthikeyann
  • Interval index and interval_range (#7182) @marlenezw
  • avro reader integration tests (#7156) @cwharris
  • Rework libcudf CMakeLists.txt to export targets for CPM (#7107) @trxcllnt
  • Adding Interval Dtype (#6984) @marlenezw
  • Cleaning up for loops with make_(counting_)transform_iterator (#6546) @codereport

- C++
Published by GPUtester almost 5 years ago

https://github.com/rapidsai/cudf - v0.19.0

🚨 Breaking Changes

  • Allow hash_partition to take a seed value (#7771) @magnatelee
  • Allow merging index column with data column using keyword "on" (#7736) @skirui-source
  • Change JNI API to avoid loading native dependencies when creating sort order classes. (#7729) @revans2
  • Replace devicevector with deviceuvector in null_mask (#7715) @harrism
  • Don't identify decimals as strings. (#7710) @vyasr
  • Fix Java Parquet write after writer API changes (#7655) @revans2
  • Convert cudf::concatenate APIs to use spans and device_uvector (#7621) @harrism
  • Update missing docstring examples in python public APIs (#7546) @galipremsagar
  • Remove unneeded step parameter from strings::detail::copy_slice (#7525) @davidwendt
  • Rename ARROWSTATICLIB because it conflicts with one in FindArrow.cmake (#7518) @trxcllnt
  • Match Pandas logic for comparing two objects with nulls (#7490) @brandon-b-miller
  • Add struct support to parquet writer (#7461) @devavret
  • Join APIs that return gathermaps (#7454) @shwina
  • fixed_point + cudf::binary_operation API Changes (#7435) @codereport
  • Fix BUG: Exception when PYTHONOPTIMIZE=2 (#7434) @skirui-source
  • Change nvtext::loadvocabularyfile to return a unique ptr (#7424) @davidwendt
  • Refactor strings column factories (#7397) @harrism
  • Use CMAKECUDAARCHITECTURES (#7391) @robertmaynard
  • Upgrade pandas to 1.2 (#7375) @galipremsagar
  • Rename logical_cast to bit_cast and allow additional conversions (#7373) @ttnghia
  • Rework libcudf CMakeLists.txt to export targets for CPM (#7107) @trxcllnt

πŸ› Bug Fixes

  • Fix a NameError in meta dispatch API (#7996) @galipremsagar
  • Reindex in DataFrame.__setitem__ (#7957) @galipremsagar
  • jitify direct-to-cubin compilation and caching. (#7919) @cwharris
  • Use dynamic cudart for nvcomp in java build (#7896) @abellina
  • fix "incompatible redefinition" warnings (#7894) @cwharris
  • cudf consistently specifies the cuda runtime (#7887) @robertmaynard
  • disable verbose output for jitify_preprocess (#7886) @cwharris
  • CMake jitpreprocessfiles function only runs when needed (#7872) @robertmaynard
  • Push DeviceScalar construction into cython for list.contains (#7864) @brandon-b-miller
  • cudf now sets an install rpath of $ORIGIN (#7863) @robertmaynard
  • Don't install Thrust examples, tests, docs, and python files (#7811) @robertmaynard
  • Sort by index in groupby tests more consistently (#7802) @shwina
  • Revert "Update conda recipes pinning of repo dependencies (#7743)" (#7793) @raydouglass
  • Add decimal column handling in copytypemetadata (#7788) @shwina
  • Add column names validation in parquet writer (#7786) @galipremsagar
  • Fix Java explode outer unit tests (#7782) @jlowe
  • Fix compiler warning about non-POD types passed through ellipsis (#7781) @jrhemstad
  • User resource fix for replace_nulls (#7769) @magnatelee
  • Fix type dispatch for columnar replace_nulls (#7768) @jlowe
  • Add ignore_order parameter to dask-cudf concat dispatch (#7765) @galipremsagar
  • Fix slicing and arrow representations of decimal columns (#7755) @vyasr
  • Fixing issue with explode_outer position not nulling position entries of null rows (#7754) @hyperbolic2346
  • Implement scatter for struct columns (#7752) @ttnghia
  • Fix data corruption in string columns (#7746) @galipremsagar
  • Fix string length in stripe dictionary building (#7744) @kaatish
  • Update conda recipes pinning of repo dependencies (#7743) @mike-wendt
  • Enable dask dispatch to cuDF's is_categorical_dtype for cuDF objects (#7740) @brandon-b-miller
  • Fix dictionary size computation in ORC writer (#7737) @vuule
  • Fix cudf::cast overflow for decimal64 to int32_t or smaller in certain cases (#7733) @codereport
  • Change JNI API to avoid loading native dependencies when creating sort order classes. (#7729) @revans2
  • Disable column_view data accessors for unsupported types (#7725) @jrhemstad
  • Materialize RangeIndex when index=True in parquet writer (#7711) @galipremsagar
  • Don't identify decimals as strings. (#7710) @vyasr
  • Fix return type of DataFrame.argsort (#7706) @galipremsagar
  • Fix/correct cudf installed package requirements (#7688) @robertmaynard
  • Fix SparkMurmurHash3_32 hash inconsistencies with Apache Spark (#7672) @jlowe
  • Fix ORC reader issue with reading empty string columns (#7656) @rgsl888prabhu
  • Fix Java Parquet write after writer API changes (#7655) @revans2
  • Fixing empty null lists throwing explode_outer for a loop. (#7649) @hyperbolic2346
  • Fix internal compiler error during JNI Docker build (#7645) @jlowe
  • Fix Debug build break with deviceuvectors in groupedrolling.cu (#7633) @mythrocks
  • Parquet reader: Fix issue when using skip_rows on non-nested columns containing nulls (#7627) @nvdbaranec
  • Fix ORC reader for empty DataFrame/Table (#7624) @rgsl888prabhu
  • Fix specifying GPU architecture in JNI build (#7612) @jlowe
  • Fix ORC writer OOM issue (#7605) @vuule
  • Fix 0.18 --> 0.19 automerge (#7589) @kkraus14
  • Fix ORC issue with incorrect timestamp nanosecond values (#7581) @vuule
  • Fix missing Dask imports (#7580) @kkraus14
  • CMAKECUDAARCHITECTURES doesn't change when build-system invokes cmake (#7579) @robertmaynard
  • Another fix for offsetsend() iterator in listscolumn_view (#7575) @ttnghia
  • Fix ORC writer output corruption with string columns (#7565) @vuule
  • Fix cudf::lists::sort_lists failing for sliced column (#7564) @ttnghia
  • FIX Fix Anaconda upload args (#7558) @dillon-cullinan
  • Fix index mismatch issue in equality related APIs (#7555) @galipremsagar
  • FIX Revert gpucicondaretry on conda file output locations (#7552) @dillon-cullinan
  • Fix offsetend iterator for listscolumn_view, which was not correctl… (#7551) @ttnghia
  • Fix no such file dlpack.h error when build libcudf (#7549) @chenrui17
  • Update missing docstring examples in python public APIs (#7546) @galipremsagar
  • Decimal32 Build Fix (#7544) @razajafri
  • FIX Retry conda output location (#7540) @dillon-cullinan
  • fix missing renames of dask git branches from master to main (#7535) @kkraus14
  • Remove detail from device_span (#7533) @rwlee
  • Change dask and distributed branch to main (#7532) @dantegd
  • Update JNI build to use CUDFUSEARROW_STATIC (#7526) @jlowe
  • Make sure rmm::rmm CMake target is visibile to cudf users (#7524) @robertmaynard
  • Fix contiguous_split not properly handling output partitions > 2 GB. (#7515) @nvdbaranec
  • Change jit launch to safe_launch (#7510) @devavret
  • Fix comparison between Datetime/Timedelta columns and NULL scalars (#7504) @brandon-b-miller
  • Fix off-by-one error in char-parallel string scalar replace (#7502) @jlowe
  • Fix JNI deprecation of all, put it on the wrong version before (#7501) @revans2
  • Fix Series/Dataframe Mixed Arithmetic (#7491) @brandon-b-miller
  • Fix JNI build after removal of libcudf sub-libraries (#7486) @jlowe
  • Correctly compile benchmarks (#7485) @robertmaynard
  • Fix bool column corruption with ORC Reader (#7483) @rgsl888prabhu
  • Fix __repr__ for categorical dtype (#7476) @galipremsagar
  • Java cleaner synchronization (#7474) @abellina
  • Fix java float/double parsing tests (#7473) @revans2
  • Pass stream and user resource to makedefaultconstructed_scalar (#7469) @magnatelee
  • Improve stability of daskcudf.DataFrame.var and daskcudf.DataFrame.std (#7453) @rjzamora
  • Missing device_storage_dispatch change affecting cudf::gather (#7449) @codereport
  • fix cuFile JNI compile errors (#7445) @rongou
  • Support Series.__setitem__ with key to a new row (#7443) @isVoid
  • Fix BUG: Exception when PYTHONOPTIMIZE=2 (#7434) @skirui-source
  • Make inclusive scan safe for cases with leading nulls (#7432) @magnatelee
  • Fix typo in listdeviceview::pairrepend() (#7423) @mythrocks
  • Fix string to double conversion and row equivalent comparison (#7410) @ttnghia
  • Fix thrust failure when transfering data from devicevector to hostvector with vectors of size 1 (#7382) @ttnghia
  • Fix std::exeception catch-by-reference gcc9 compile error (#7380) @davidwendt
  • Fix skiprows issue with ORC Reader (#7359) @rgsl888prabhu
  • fix Arrow CMake file (#7358) @rongou
  • Fix lists::contains() for NaN and Decimals (#7349) @mythrocks
  • Handle cupy array in Dataframe.__setitem__ (#7340) @galipremsagar
  • Fix invalid-device-fn error in cudf::strings::replace_re with multiple regex's (#7336) @davidwendt
  • FIX Add codecov upload block to gpu script (#6860) @dillon-cullinan

πŸ“– Documentation

  • Fix join API doxygen (#7890) @shwina
  • Add Resources to README. (#7697) @bdice
  • Add isin examples in Docstring (#7479) @galipremsagar
  • Resolving unlinked type shorthands in cudf doc (#7416) @isVoid
  • Fix typo in regex.md doc page (#7363) @davidwendt
  • Fix incorrect stringscolumnview::chars_size documentation (#7360) @jlowe

πŸš€ New Features

  • Enable basic reductions for decimal columns (#7776) @ChrisJar
  • Enable join on decimal columns (#7764) @ChrisJar
  • Allow merging index column with data column using keyword "on" (#7736) @skirui-source
  • Implement DecimalColumn + Scalar and add cudf.Scalars of Decimal64Dtype (#7732) @brandon-b-miller
  • Add support for unique groupby aggregation (#7726) @shwina
  • Expose libcudf's label_bins function to cudf (#7724) @vyasr
  • Adding support for equi-join on struct (#7720) @hyperbolic2346
  • Add decimal column comparison operations (#7716) @isVoid
  • Implement scan operations for decimal columns (#7707) @ChrisJar
  • Enable typecasting between decimal and int (#7691) @ChrisJar
  • Enable decimal support in parquet writer (#7673) @devavret
  • Adds list.unique API (#7664) @isVoid
  • Fix NaN handling in droplistduplicates (#7662) @ttnghia
  • Add lists.sort_values API (#7657) @isVoid
  • Add is_integer API that can check for the validity of a string-to-integer conversion (#7642) @ttnghia
  • Adds explode API (#7607) @isVoid
  • Adds list.take, python binding for cudf::lists::segmented_gather (#7591) @isVoid
  • Implement cudf::label_bins() (#7554) @vyasr
  • Add Python bindings for lists::contains (#7547) @skirui-source
  • cudf::rowbitcount() support. (#7534) @nvdbaranec
  • Implement droplistduplicates (#7528) @ttnghia
  • Add Python bindings for lists::extract_lists_element (#7505) @skirui-source
  • Add explodeouter and explodeouter_position (#7499) @hyperbolic2346
  • Match Pandas logic for comparing two objects with nulls (#7490) @brandon-b-miller
  • Add struct support to parquet writer (#7461) @devavret
  • Enable type conversion from float to decimal type (#7450) @ChrisJar
  • Add cython for converting strings/fixed-point functions (#7429) @davidwendt
  • Add struct column support to cudf::sort and cudf::sorted_order (#7422) @karthikeyann
  • Implement groupby collect_set (#7420) @ttnghia
  • Merge branch-0.18 into branch-0.19 (#7411) @raydouglass
  • Refactor strings column factories (#7397) @harrism
  • Add groupby scan operations (sort groupby) (#7387) @karthikeyann
  • Add cudf::explode_position (#7376) @hyperbolic2346
  • Add string conversion to/from decimal values libcudf APIs (#7364) @davidwendt
  • Add groupby SUMOFSQUARES support (#7362) @karthikeyann
  • Add Series.drop api (#7304) @isVoid
  • getjsonobject() implementation (#7286) @nvdbaranec
  • Python API for LIstMethods.len() (#7283) @isVoid
  • Support null_policy::EXCLUDE for COLLECT rolling aggregation (#7264) @mythrocks
  • Add support for special tokens in nvtext::subword_tokenizer (#7254) @davidwendt
  • Fix inplace update of data and add Series.update (#7201) @galipremsagar
  • Implement cudf::group_by (hash) for decimal32 and decimal64 (#7190) @codereport
  • Adding support to specify "level" parameter for Dataframe.rename (#7135) @skirui-source

πŸ› οΈ Improvements

  • fix GDS include path for version 0.95 (#7877) @rongou
  • Update dask + distributed to 2021.4.0 (#7858) @jakirkham
  • Add ability to extract include dirs from CUDF_HOME (#7848) @galipremsagar
  • Add USE_GDS as an option in build script (#7833) @pxLi
  • add an allocate method with stream in java DeviceMemoryBuffer (#7826) @rongou
  • Constrain dask and distributed versions to 2021.3.1 (#7825) @shwina
  • Revert dask versioning of concat dispatch (#7823) @galipremsagar
  • add copy methods in Java memory buffer (#7791) @rongou
  • Update README and CONTRIBUTING for 0.19 (#7778) @robertmaynard
  • Allow hash_partition to take a seed value (#7771) @magnatelee
  • Turn on NVTX by default in java build (#7761) @tgravescs
  • Add Java bindings to join gather map APIs (#7751) @jlowe
  • Add replacements column support for Java replaceNulls (#7750) @jlowe
  • Add Java bindings for rowbitcount (#7749) @jlowe
  • Remove unused JVM array creation (#7748) @jlowe
  • Added JNI support for new is_integer (#7739) @revans2
  • Create and promote library aliases in libcudf installations (#7734) @trxcllnt
  • Support groupby operations for decimal dtypes (#7731) @vyasr
  • Memory map the input file only when GDS compatiblity mode is not used (#7717) @vuule
  • Replace devicevector with deviceuvector in null_mask (#7715) @harrism
  • Struct hashing support for SerialMurmur3 and SparkMurmur3 (#7714) @jlowe
  • Add gbenchmark for nvtext replace-tokens function (#7708) @davidwendt
  • Use stream in groupby calls (#7705) @karthikeyann
  • Update codeowners file (#7701) @ajschmidt8
  • Cleanup groupby to use hostspan, devicespan, device_uvector (#7698) @karthikeyann
  • Add gbenchmark for nvtext ngrams functions (#7693) @davidwendt
  • Misc Python/Cython optimizations (#7686) @shwina
  • Add gbenchmark for nvtext tokenize functions (#7684) @davidwendt
  • Add columndeviceview to orc writer (#7676) @kaatish
  • cudf_kafka now uses cuDF CMake export targets (CPM) (#7674) @robertmaynard
  • Add gbenchmark for nvtext normalize functions (#7668) @davidwendt
  • Resolve unnecessary import of thrust/optional.hpp in types.hpp (#7667) @vyasr
  • Feature/optimize accessor copy (#7660) @vyasr
  • Fix find_package(cudf) (#7658) @trxcllnt
  • Work-around for gcc7 compile error on Centos7 (#7652) @davidwendt
  • Add in JNI support for count_elements (#7651) @revans2
  • Fix issues with building cudf in a non-conda environment (#7647) @galipremsagar
  • Refactor ConfigureCUDA to not conditionally insert compiler flags (#7643) @robertmaynard
  • Add gbenchmark for converting strings to/from timestamps (#7641) @davidwendt
  • Handle constructing a cudf.Scalar from a cudf.Scalar (#7639) @shwina
  • Add in JNI support for table partition (#7637) @revans2
  • Add explicit fixed_point merge test (#7635) @codereport
  • Add JNI support for IDENTITY hash partitioning (#7626) @revans2
  • Java support on explode_outer (#7625) @sperlingxx
  • Java support of casting string from/to decimal (#7623) @sperlingxx
  • Convert cudf::concatenate APIs to use spans and device_uvector (#7621) @harrism
  • Add gbenchmark for cudf::strings::translate function (#7617) @davidwendt
  • Use file(COPY ) over file(INSTALL ) so cmake output is reduced (#7616) @robertmaynard
  • Use rmm::deviceuvector in place of rmm::devicevector for ORC reader/writer and cudf::io::column_buffer (#7614) @vuule
  • Refactor Java host-side buffer concatenation to expose separate steps (#7610) @jlowe
  • Add gbenchmarks for string substrings functions (#7603) @davidwendt
  • Refactor string conversion check (#7599) @ttnghia
  • JNI: Pass names of children struct columns to native Arrow IPC writer (#7598) @firestarman
  • Revert "ENH Fix stale GHA and prevent duplicates " (#7595) @mike-wendt
  • ENH Fix stale GHA and prevent duplicates (#7594) @mike-wendt
  • Fix auto-detecting GPU architectures (#7593) @trxcllnt
  • Reduce cudf library size (#7583) @robertmaynard
  • Optimize cudf::makestringscolumn for long strings (#7576) @davidwendt
  • Always build and export the cudf::cudftestutil target (#7574) @trxcllnt
  • Eliminate literal parameters to uvector::setelementasync and devicescalar::setvalue (#7563) @harrism
  • Add gbenchmark for strings::concatenate (#7560) @davidwendt
  • Update Changelog Link (#7550) @ajschmidt8
  • Add gbenchmarks for strings replace regex functions (#7541) @davidwendt
  • Add __repr__ for Column and ColumnAccessor (#7531) @shwina
  • Support Decimal DIV changes in cudf (#7527) @razajafri
  • Remove unneeded step parameter from strings::detail::copy_slice (#7525) @davidwendt
  • Use deviceuvector, devicespan in sort groupby (#7523) @karthikeyann
  • Add gbenchmarks for strings extract function (#7522) @davidwendt
  • Rename ARROWSTATICLIB because it conflicts with one in FindArrow.cmake (#7518) @trxcllnt
  • Reduce compile time/size for scan.cu (#7516) @davidwendt
  • Change devicevector to deviceuvector in nvtext source files (#7512) @davidwendt
  • Removed unneeded includes from traits.hpp (#7509) @davidwendt
  • FIX Remove random build directory generation for ccache (#7508) @dillon-cullinan
  • xfail failing pytest in pandas 1.2.3 (#7507) @galipremsagar
  • JNI bit cast (#7493) @revans2
  • Combine rolling window function tests (#7480) @mythrocks
  • Prepare Changelog for Automation (#7477) @ajschmidt8
  • Java support for explode position (#7471) @sperlingxx
  • Update 0.18 changelog entry (#7463) @ajschmidt8
  • JNI: Support skipping nulls for collect aggregation (#7457) @firestarman
  • Join APIs that return gathermaps (#7454) @shwina
  • Remove dependence on managed memory for multimap test (#7451) @jrhemstad
  • Use cuFile for Parquet IO when available (#7444) @vuule
  • Statistics cleanup (#7439) @kaatish
  • Add gbenchmarks for strings filter functions (#7438) @davidwendt
  • fixed_point + cudf::binary_operation API Changes (#7435) @codereport
  • Improve string gather performance (#7433) @jlowe
  • Don't use user resource for a temporary allocation in sortbykey (#7431) @magnatelee
  • Detail APIs for datetime functions (#7430) @magnatelee
  • Replace thrust::maxelement with thrust::reduce in strings findallre (#7428) @davidwendt
  • Add gbenchmark for strings split/split_record functions (#7427) @davidwendt
  • Update JNI build to use CMAKECUDAARCHITECTURES (#7425) @jlowe
  • Change nvtext::loadvocabularyfile to return a unique ptr (#7424) @davidwendt
  • Simplify type dispatch with device_storage_dispatch (#7419) @codereport
  • Java support for casting of nested child columns (#7417) @razajafri
  • Improve scalar string replace performance for long strings (#7415) @jlowe
  • Remove unneeded temporary device vector for strings scatter specialization (#7409) @davidwendt
  • bitmask_or implementation with bitmask refactor (#7406) @rwlee
  • Add other cudf::strings::replace functions to current strings replace gbenchmark (#7403) @davidwendt
  • Clean up included headers in device_operators.cuh (#7401) @codereport
  • Move nullable index iterator to indexalator factory (#7399) @davidwendt
  • ENH Pass ccache variables to conda recipe & use Ninja in CI (#7398) @Ethyling
  • upgrade maven-antrun-plugin to support maven parallel builds (#7393) @rongou
  • Add gbenchmark for strings find/contains functions (#7392) @davidwendt
  • Use CMAKECUDAARCHITECTURES (#7391) @robertmaynard
  • Refactor libcudf strings::replace to use makestringschildren utility (#7384) @davidwendt
  • Added in JNI support for out of core sort algorithm (#7381) @revans2
  • Upgrade pandas to 1.2 (#7375) @galipremsagar
  • Rename logical_cast to bit_cast and allow additional conversions (#7373) @ttnghia
  • jitify 2 support (#7372) @cwharris
  • compile_udf: Cache PTX for similar functions (#7371) @gmarkall
  • Add string scalar replace benchmark (#7369) @jlowe
  • Add gbenchmark for strings containsre/countre functions (#7366) @davidwendt
  • Update orc reader and writer fuzz tests (#7357) @galipremsagar
  • Improve url_decode performance for long strings (#7353) @jlowe
  • cudf::ast Small Refactorings (#7352) @codereport
  • Remove std::cout and print in the scatter test function EmptyListsOfNullableStrings. (#7342) @ttnghia
  • Use cudf::detail::make_counting_transform_iterator (#7338) @codereport
  • Change block size parameter from a global to a template param. (#7333) @nvdbaranec
  • Partial clean up of ORC writer (#7324) @vuule
  • Add gbenchmark for cudf::strings::to_lower (#7316) @davidwendt
  • Update Java bindings version to 0.19-SNAPSHOT (#7307) @pxLi
  • Move cudf::test::make_counting_transform_iterator to cudf/detail/iterator.cuh (#7306) @codereport
  • Use string literals in fixed_point release_asserts (#7303) @codereport
  • Fix merge conflicts for #7295 (#7297) @ajschmidt8
  • Add UTF-8 chars to createrandomcolumn<string_view> benchmark utility (#7292) @davidwendt
  • Abstracting block reduce and block scan from cuIO kernels with cub apis (#7278) @rgsl888prabhu
  • Build.sh use cmake --build to drive build system invocation (#7270) @robertmaynard
  • Refactor dictionary support for reductions any/all (#7242) @davidwendt
  • Replace stream.value() with stream for stream_view args (#7236) @karthikeyann
  • Interval index and interval_range (#7182) @marlenezw
  • avro reader integration tests (#7156) @cwharris
  • Rework libcudf CMakeLists.txt to export targets for CPM (#7107) @trxcllnt
  • Adding Interval Dtype (#6984) @marlenezw
  • Cleaning up for loops with make_(counting_)transform_iterator (#6546) @codereport

- C++
Published by GPUtester almost 5 years ago

https://github.com/rapidsai/cudf - v0.18.1

- C++
Published by GPUtester almost 5 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v0.18.0

πŸ”— Links

🚨 Breaking Changes

  • Default groupby to sort=False (#7180) @isVoid
  • Add libcudf API for parsing of ORC statistics (#7136) @vuule
  • Replace ORC writer api with class (#7099) @rgsl888prabhu
  • Pack/unpack functionality to convert tables to and from a serialized format. (#7096) @nvdbaranec
  • Replace parquet writer api with class (#7058) @rgsl888prabhu
  • Add days check to cudf::is_timestamp using cuda::std::chrono classes (#7028) @davidwendt
  • Fix default parameter values of write_csv and write_parquet (#6967) @vuule
  • Align Series.groupby API to match Pandas (#6964) @kkraus14
  • Share factorize implementation with Index and cudf module (#6885) @brandon-b-miller

πŸ› Bug Fixes

  • Fix null-bounds calculation for ranged window queries (#7568) @mythrocks
  • Remove incorrect std::move call on return variable (#7319) @davidwendt
  • Fix failing CI ORC test (#7313) @vuule
  • Disallow constructing frames from a ColumnAccessor (#7298) @shwina
  • fix java cuFile tests (#7296) @rongou
  • Fix style issues related to NumPy (#7279) @shwina
  • Fix bug when iloc slice terminates at before-the-zero position (#7277) @isVoid
  • Fix copying dtype metadata after calling libcudf functions (#7271) @shwina
  • Move lists utility function definition out of header (#7266) @mythrocks
  • Throw if bool column would cause incorrect result when writing to ORC (#7261) @vuule
  • Use uvector in replace_nulls; Fix sort_helper::grouped_value doc (#7256) @isVoid
  • Remove floating point types from cudf::sort fast-path (#7250) @davidwendt
  • Disallow picking output columns from nested columns. (#7248) @devavret
  • Fix loc for Series with a MultiIndex (#7243) @shwina
  • Fix Arrow column test leaks (#7241) @tgravescs
  • Fix test column vector leak (#7238) @kuhushukla
  • Fix some bugs in java scalar support for decimal (#7237) @revans2
  • Improve assert_eq handling of scalar (#7220) @isVoid
  • Fix missing null_count() comparison in test framework and related failures (#7219) @nvdbaranec
  • Remove floating point types from radix sort fast-path (#7215) @davidwendt
  • Fixing parquet benchmarks (#7214) @rgsl888prabhu
  • Handle various parameter combinations in replace API (#7207) @galipremsagar
  • Export mock aws credentials for s3 tests (#7176) @ayushdg
  • Add MultiIndex.rename API (#7172) @isVoid
  • Fix importing list & struct types in from_arrow (#7162) @galipremsagar
  • Fixing parquet precision writing failing if scale is equal to precision (#7146) @hyperbolic2346
  • Update s3 tests to use moto_server (#7144) @ayushdg
  • Fix JIT cache multi-process test flakiness in slow drives (#7142) @devavret
  • Fix compilation errors in libcudf (#7138) @galipremsagar
  • Fix compilation failure caused by -Wall addition. (#7134) @codereport
  • Add informative error message for sep in CSV writer (#7095) @galipremsagar
  • Add JIT cache per compute capability (#7090) @devavret
  • Implement __hash__ method for ListDtype (#7081) @galipremsagar
  • Only upload packages that were built (#7077) @raydouglass
  • Fix comparisons between Series and cudf.NA (#7072) @brandon-b-miller
  • Handle nan values correctly in Series.one_hot_encoding (#7059) @galipremsagar
  • Add unstack() support for non-multiindexed dataframes (#7054) @isVoid
  • Fix read_orc for decimal type (#7034) @rgsl888prabhu
  • Fix backward compatibility of loading a 0.16 pkl file (#7033) @galipremsagar
  • Decimal casts in JNI became a NOOP (#7032) @revans2
  • Restore usual instance/subclass checking to cudf.DateOffset (#7029) @shwina
  • Add days check to cudf::is_timestamp using cuda::std::chrono classes (#7028) @davidwendt
  • Fix to_csv delimiter handling of timestamp format (#7023) @davidwendt
  • Pin librdkakfa to gcc 7 compatible version (#7021) @raydouglass
  • Fix fillna & dropna to also consider np.nan as a missing value (#7019) @galipremsagar
  • Fix round operator's HALF_EVEN computation for negative integers (#7014) @nartal1
  • Skip Thrust sort patch if already applied (#7009) @harrism
  • Fix cudf::hash_partition for decimal32 and decimal64 (#7006) @codereport
  • Fix Thrust unroll patch command (#7002) @harrism
  • Fix loc behaviour when key of incorrect type is used (#6993) @shwina
  • Fix int to datetime conversion in csv_read (#6991) @kaatish
  • fix excluding cufile tests by default (#6988) @rongou
  • Fix java cufile tests when cufile is not installed (#6987) @revans2
  • Make cudf::round for fixed_point when scale = -decimal_places a no-op (#6975) @codereport
  • Fix type comparison for java (#6970) @revans2
  • Fix default parameter values of write_csv and write_parquet (#6967) @vuule
  • Align Series.groupby API to match Pandas (#6964) @kkraus14
  • Fix timestamp parsing in ORC reader for timezones without transitions (#6959) @vuule
  • Fix typo in numerical.py (#6957) @rgsl888prabhu
  • fixed_point_value double-shifts in fixed_point construction (#6950) @codereport
  • fix libcu++ include path for jni (#6948) @rongou
  • Fix groupby agg/apply behaviour when no key columns are provided (#6945) @shwina
  • Avoid inserting null elements into join hash table when nulls are treated as unequal (#6943) @hyperbolic2346
  • Fix cudf::merge gtest for dictionary columns (#6942) @davidwendt
  • Pass numeric scalars of the same dtype through numeric binops (#6938) @brandon-b-miller
  • Fix N/A detection for empty fields in CSV reader (#6922) @vuule
  • Fix rmm_mode=managed parameter for gtests (#6912) @davidwendt
  • Fix nullmask offset handling in parquet and orc writer (#6889) @kaatish
  • Correct the sampling range when sampling with replacement (#6884) @ChrisJar
  • Handle nested string columns with no children in contiguous_split. (#6864) @nvdbaranec
  • Fix columns & index handling in dataframe constructor (#6838) @galipremsagar

πŸ“– Documentation

  • Update readme (#7318) @shwina
  • Fix typo in cudf.core.column.string.extract docs (#7253) @adelevie
  • Update doxyfile project number (#7161) @davidwendt
  • Update 10 minutes to cuDF and CuPy with new APIs (#7158) @ChrisJar
  • Cross link RMM & libcudf Doxygen docs (#7149) @ajschmidt8
  • Add documentation for support dtypes in all IO formats (#7139) @galipremsagar
  • Add groupby docs (#7100) @shwina
  • Update cudf python docstrings with new null representation (&lt;NA&gt;) (#7050) @galipremsagar
  • Make Doxygen comments formatting consistent (#7041) @vuule
  • Add docs for working with missing data (#7010) @galipremsagar
  • Remove warning in fromdlpack and todlpack methods (#7001) @miguelusque
  • libcudf Developer Guide (#6977) @harrism
  • Add JNI wrapper for the cuFile API (GDS) (#6940) @rongou

πŸš€ New Features

  • Support numeric_only field for rank() (#7213) @isVoid
  • Add support for cudf::binary_operation TRUE_DIV for decimal32 and decimal64 (#7198) @codereport
  • Implement COLLECT rolling window aggregation (#7189) @mythrocks
  • Add support for array-like inputs in cudf.get_dummies (#7181) @galipremsagar
  • Default groupby to sort=False (#7180) @isVoid
  • Add libcudf lists column count_elements API (#7173) @davidwendt
  • Implement cudf::group_by (sort) for decimal32 and decimal64 (#7169) @codereport
  • Add encoding and compression argument to CSV writer (#7168) @VibhuJawa
  • cudf::rolling_window SUM support for decimal32 and decimal64 (#7147) @codereport
  • Adding support for explode to cuDF (#7140) @hyperbolic2346
  • Add libcudf API for parsing of ORC statistics (#7136) @vuule
  • update GDS/cuFile location for 0.9 release (#7131) @rongou
  • Add Segmented sort (#7122) @karthikeyann
  • Add cudf::binary_operation NULL_MIN, NULL_MAX & NULL_EQUALS for decimal32 and decimal64 (#7119) @codereport
  • Add scale and value methods to fixed_point (#7109) @codereport
  • Replace ORC writer api with class (#7099) @rgsl888prabhu
  • Pack/unpack functionality to convert tables to and from a serialized format. (#7096) @nvdbaranec
  • Improve digitize API (#7071) @isVoid
  • Add List types support in data generator (#7064) @galipremsagar
  • cudf::scan support for decimal32 and decimal64 (#7063) @codereport
  • cudf::rolling ROW_NUMBER support for decimal32 and decimal64 (#7061) @codereport
  • Replace parquet writer api with class (#7058) @rgsl888prabhu
  • Support contains() on lists of primitives (#7039) @mythrocks
  • Implement cudf::rolling for decimal32 and decimal64 (#7037) @codereport
  • Add ffill and bfill to string columns (#7036) @isVoid
  • Enable round in cudf for DataFrame and Series (#7022) @ChrisJar
  • Extend replace_nulls_policy to string and dictionary type (#7004) @isVoid
  • Add segmentedgather(listcolumn, gather_list) (#7003) @karthikeyann
  • Add method field to fillna for fixed width columns (#6998) @isVoid
  • Manual merge of branch 0.17 into branch 0.18 (#6995) @shwina
  • Implement cudf::reduce for decimal32 and decimal64 (part 2) (#6980) @codereport
  • Add Ufunc alias look up for appropriate numpy ufunc dispatching (#6973) @VibhuJawa
  • Add pytest-xdist to dev environment.yml (#6958) @galipremsagar
  • Add Index.set_names api (#6929) @galipremsagar
  • Add replace_null API with replace_policy parameter, fixed_width column support (#6907) @isVoid
  • Share factorize implementation with Index and cudf module (#6885) @brandon-b-miller
  • Implement update() function (#6883) @skirui-source
  • Add groupby idxmin, idxmax aggregation (#6856) @karthikeyann
  • Implement cudf::reduce for decimal32 and decimal64 (part 1) (#6814) @codereport
  • Implement cudf.DateOffset for months (#6775) @brandon-b-miller
  • Add Python DecimalColumn (#6715) @shwina
  • Add dictionary support to libcudf groupby functions (#6585) @davidwendt

πŸ› οΈ Improvements

  • Update stale GHA with exemptions & new labels (#7395) @mike-wendt
  • Add GHA to mark issues/prs as stale/rotten (#7388) @Ethyling
  • Unpin from numpy < 1.20 (#7335) @shwina
  • Prepare Changelog for Automation (#7309) @galipremsagar
  • Prepare Changelog for Automation (#7272) @ajschmidt8
  • Add JNI support for converting Arrow buffers to CUDF ColumnVectors (#7222) @tgravescs
  • Add coverage for skiprows and num_rows in parquet reader fuzz testing (#7216) @galipremsagar
  • Define and implement more behavior for merging on categorical variables (#7209) @brandon-b-miller
  • Add CudfSeriesGroupBy to optimize dask_cudf groupby-mean (#7194) @rjzamora
  • Add dictionary column support to rolling_window (#7186) @davidwendt
  • Modify the semantics of end pointers in cuIO to match standard library (#7179) @vuule
  • Adding unit tests for fixed_point with extremely large scales (#7178) @codereport
  • Fast path single column sort (#7167) @davidwendt
  • Fix -Werror=sign-compare errors in device code (#7164) @trxcllnt
  • Refactor cudf::string_view host and device code (#7159) @davidwendt
  • Enable logic for GPU auto-detection in cudfjni (#7155) @gerashegalov
  • Java bindings for Fixed-point type support for Parquet (#7153) @razajafri
  • Add Java interface for the new API 'explode' (#7151) @firestarman
  • Replace offsets with iterators in cuIO utilities and CSV parser (#7150) @vuule
  • Add gbenchmarks for reduction aggregations any() and all() (#7129) @davidwendt
  • Update JNI for contiguous_split packed results (#7127) @jlowe
  • Add JNI and Java bindings for list_contains (#7125) @kuhushukla
  • Add Java unit tests for window aggregate 'collect' (#7121) @firestarman
  • verify window operations on decimal with java tests (#7120) @sperlingxx
  • Adds in JNI support for creating an list column from existing columns (#7112) @revans2
  • Build libcudf with -Wall (#7105) @trxcllnt
  • Add columndeviceview pointers to EncColumnDesc (#7097) @kaatish
  • Add pyorc to dev environment (#7085) @galipremsagar
  • JNI support for creating struct column from existing columns and fixed bug in struct with no children (#7084) @revans2
  • Fastpath single strings column in cudf::sort (#7075) @davidwendt
  • Upgrade nvcomp to 1.2.1 (#7069) @rongou
  • Refactor ORC ProtobufReader to make it more extendable (#7055) @vuule
  • Add Java tests for decimal casts (#7051) @sperlingxx
  • Auto-label PRs based on their content (#7044) @jolorunyomi
  • Create sort gbenchmark for strings column (#7040) @davidwendt
  • Refactor io memory fetches to use hostdevice_vector methods (#7035) @ChrisJar
  • Spark Murmur3 hash functionality (#7024) @rwlee
  • Fix libcudf strings logic where size_type is used to access INT32 column data (#7020) @davidwendt
  • Adding decimal writing support to parquet (#7017) @hyperbolic2346
  • Add compression="infer" as default for daskcudf.readcsv (#7013) @rjzamora
  • Correct ORC docstring; other minor cuIO improvements (#7012) @vuule
  • Reduce number of hostdevice_vector allocations in parquet reader (#7005) @devavret
  • Check output size overflow on strings gather (#6997) @davidwendt
  • Improve representation of MultiIndex (#6992) @galipremsagar
  • Disable some pragma unroll statements in thrust sort.h (#6982) @davidwendt
  • Minor cudf::round internal refactoring (#6976) @codereport
  • Add Java bindings for URL conversion (#6972) @jlowe
  • Enable strictdecimaltypes in parquet reading (#6969) @sperlingxx
  • Add in basic support to JNI for logical_cast (#6954) @revans2
  • Remove duplicate file array_tests.cpp (#6953) @karthikeyann
  • Add null mask fixed_point_column_wrapper constructors (#6951) @codereport
  • Update Java bindings version to 0.18-SNAPSHOT (#6949) @jlowe
  • Use simplified rmm::exec_policy (#6939) @harrism
  • Add null count test for applybooleanmask (#6903) @harrism
  • Implement DataFrame.quantile for datetime and timedelta data types (#6902) @ChrisJar
  • Remove **kwargs from string/categorical methods (#6750) @shwina
  • Refactor rolling.cu to reduce compile time (#6512) @mythrocks
  • Add static type checking via Mypy (#6381) @shwina
  • Update to official libcu++ on Github (#6275) @trxcllnt

- C++
Published by rapids-bot[bot] almost 5 years ago

https://github.com/rapidsai/cudf - v0.18.0

Breaking Changes 🚨

  • Default groupby to sort=False (#7180) @isVoid
  • Add libcudf API for parsing of ORC statistics (#7136) @vuule
  • Replace ORC writer api with class (#7099) @rgsl888prabhu
  • Pack/unpack functionality to convert tables to and from a serialized format. (#7096) @nvdbaranec
  • Replace parquet writer api with class (#7058) @rgsl888prabhu
  • Add days check to cudf::is_timestamp using cuda::std::chrono classes (#7028) @davidwendt
  • Fix default parameter values of write_csv and write_parquet (#6967) @vuule
  • Align Series.groupby API to match Pandas (#6964) @kkraus14
  • Share factorize implementation with Index and cudf module (#6885) @brandon-b-miller

Bug Fixes πŸ›

  • Remove incorrect std::move call on return variable (#7319) @davidwendt
  • Fix failing CI ORC test (#7313) @vuule
  • Disallow constructing frames from a ColumnAccessor (#7298) @shwina
  • fix java cuFile tests (#7296) @rongou
  • Fix style issues related to NumPy (#7279) @shwina
  • Fix bug when iloc slice terminates at before-the-zero position (#7277) @isVoid
  • Fix copying dtype metadata after calling libcudf functions (#7271) @shwina
  • Move lists utility function definition out of header (#7266) @mythrocks
  • Throw if bool column would cause incorrect result when writing to ORC (#7261) @vuule
  • Use uvector in replace_nulls; Fix sort_helper::grouped_value doc (#7256) @isVoid
  • Remove floating point types from cudf::sort fast-path (#7250) @davidwendt
  • Disallow picking output columns from nested columns. (#7248) @devavret
  • Fix loc for Series with a MultiIndex (#7243) @shwina
  • Fix Arrow column test leaks (#7241) @tgravescs
  • Fix test column vector leak (#7238) @kuhushukla
  • Fix some bugs in java scalar support for decimal (#7237) @revans2
  • Improve assert_eq handling of scalar (#7220) @isVoid
  • Fix missing null_count() comparison in test framework and related failures (#7219) @nvdbaranec
  • Remove floating point types from radix sort fast-path (#7215) @davidwendt
  • Fixing parquet benchmarks (#7214) @rgsl888prabhu
  • Handle various parameter combinations in replace API (#7207) @galipremsagar
  • Export mock aws credentials for s3 tests (#7176) @ayushdg
  • Add MultiIndex.rename API (#7172) @isVoid
  • Fix importing list & struct types in from_arrow (#7162) @galipremsagar
  • Fixing parquet precision writing failing if scale is equal to precision (#7146) @hyperbolic2346
  • Update s3 tests to use moto_server (#7144) @ayushdg
  • Fix JIT cache multi-process test flakiness in slow drives (#7142) @devavret
  • Fix compilation errors in libcudf (#7138) @galipremsagar
  • Fix compilation failure caused by -Wall addition. (#7134) @codereport
  • Add informative error message for sep in CSV writer (#7095) @galipremsagar
  • Add JIT cache per compute capability (#7090) @devavret
  • Implement __hash__ method for ListDtype (#7081) @galipremsagar
  • Only upload packages that were built (#7077) @raydouglass
  • Fix comparisons between Series and cudf.NA (#7072) @brandon-b-miller
  • Handle nan values correctly in Series.one_hot_encoding (#7059) @galipremsagar
  • Add unstack() support for non-multiindexed dataframes (#7054) @isVoid
  • Fix read_orc for decimal type (#7034) @rgsl888prabhu
  • Fix backward compatibility of loading a 0.16 pkl file (#7033) @galipremsagar
  • Decimal casts in JNI became a NOOP (#7032) @revans2
  • Restore usual instance/subclass checking to cudf.DateOffset (#7029) @shwina
  • Add days check to cudf::is_timestamp using cuda::std::chrono classes (#7028) @davidwendt
  • Fix to_csv delimiter handling of timestamp format (#7023) @davidwendt
  • Pin librdkakfa to gcc 7 compatible version (#7021) @raydouglass
  • Fix fillna & dropna to also consider np.nan as a missing value (#7019) @galipremsagar
  • Fix round operator's HALF_EVEN computation for negative integers (#7014) @nartal1
  • Skip Thrust sort patch if already applied (#7009) @harrism
  • Fix cudf::hash_partition for decimal32 and decimal64 (#7006) @codereport
  • Fix Thrust unroll patch command (#7002) @harrism
  • Fix loc behaviour when key of incorrect type is used (#6993) @shwina
  • Fix int to datetime conversion in csv_read (#6991) @kaatish
  • fix excluding cufile tests by default (#6988) @rongou
  • Fix java cufile tests when cufile is not installed (#6987) @revans2
  • Make cudf::round for fixed_point when scale = -decimal_places a no-op (#6975) @codereport
  • Fix type comparison for java (#6970) @revans2
  • Fix default parameter values of write_csv and write_parquet (#6967) @vuule
  • Align Series.groupby API to match Pandas (#6964) @kkraus14
  • Fix timestamp parsing in ORC reader for timezones without transitions (#6959) @vuule
  • Fix typo in numerical.py (#6957) @rgsl888prabhu
  • fixed_point_value double-shifts in fixed_point construction (#6950) @codereport
  • fix libcu++ include path for jni (#6948) @rongou
  • Fix groupby agg/apply behaviour when no key columns are provided (#6945) @shwina
  • Avoid inserting null elements into join hash table when nulls are treated as unequal (#6943) @hyperbolic2346
  • Fix cudf::merge gtest for dictionary columns (#6942) @davidwendt
  • Pass numeric scalars of the same dtype through numeric binops (#6938) @brandon-b-miller
  • Fix N/A detection for empty fields in CSV reader (#6922) @vuule
  • Fix rmm_mode=managed parameter for gtests (#6912) @davidwendt
  • Fix nullmask offset handling in parquet and orc writer (#6889) @kaatish
  • Correct the sampling range when sampling with replacement (#6884) @ChrisJar
  • Handle nested string columns with no children in contiguous_split. (#6864) @nvdbaranec
  • Fix columns & index handling in dataframe constructor (#6838) @galipremsagar

Documentation πŸ“–

  • Update readme (#7318) @shwina
  • Fix typo in cudf.core.column.string.extract docs (#7253) @adelevie
  • Update doxyfile project number (#7161) @davidwendt
  • Update 10 minutes to cuDF and CuPy with new APIs (#7158) @ChrisJar
  • Cross link RMM & libcudf Doxygen docs (#7149) @ajschmidt8
  • Add documentation for support dtypes in all IO formats (#7139) @galipremsagar
  • Add groupby docs (#7100) @shwina
  • Update cudf python docstrings with new null representation (&lt;NA&gt;) (#7050) @galipremsagar
  • Make Doxygen comments formatting consistent (#7041) @vuule
  • Add docs for working with missing data (#7010) @galipremsagar
  • Remove warning in fromdlpack and todlpack methods (#7001) @miguelusque
  • libcudf Developer Guide (#6977) @harrism
  • Add JNI wrapper for the cuFile API (GDS) (#6940) @rongou

New Features πŸš€

  • Support numeric_only field for rank() (#7213) @isVoid
  • Add support for cudf::binary_operation TRUE_DIV for decimal32 and decimal64 (#7198) @codereport
  • Implement COLLECT rolling window aggregation (#7189) @mythrocks
  • Add support for array-like inputs in cudf.get_dummies (#7181) @galipremsagar
  • Default groupby to sort=False (#7180) @isVoid
  • Add libcudf lists column count_elements API (#7173) @davidwendt
  • Implement cudf::group_by (sort) for decimal32 and decimal64 (#7169) @codereport
  • Add encoding and compression argument to CSV writer (#7168) @VibhuJawa
  • cudf::rolling_window SUM support for decimal32 and decimal64 (#7147) @codereport
  • Adding support for explode to cuDF (#7140) @hyperbolic2346
  • Add libcudf API for parsing of ORC statistics (#7136) @vuule
  • update GDS/cuFile location for 0.9 release (#7131) @rongou
  • Add Segmented sort (#7122) @karthikeyann
  • Add cudf::binary_operation NULL_MIN, NULL_MAX & NULL_EQUALS for decimal32 and decimal64 (#7119) @codereport
  • Add scale and value methods to fixed_point (#7109) @codereport
  • Replace ORC writer api with class (#7099) @rgsl888prabhu
  • Pack/unpack functionality to convert tables to and from a serialized format. (#7096) @nvdbaranec
  • Improve digitize API (#7071) @isVoid
  • Add List types support in data generator (#7064) @galipremsagar
  • cudf::scan support for decimal32 and decimal64 (#7063) @codereport
  • cudf::rolling ROW_NUMBER support for decimal32 and decimal64 (#7061) @codereport
  • Replace parquet writer api with class (#7058) @rgsl888prabhu
  • Support contains() on lists of primitives (#7039) @mythrocks
  • Implement cudf::rolling for decimal32 and decimal64 (#7037) @codereport
  • Add ffill and bfill to string columns (#7036) @isVoid
  • Enable round in cudf for DataFrame and Series (#7022) @ChrisJar
  • Extend replace_nulls_policy to string and dictionary type (#7004) @isVoid
  • Add segmentedgather(listcolumn, gather_list) (#7003) @karthikeyann
  • Add method field to fillna for fixed width columns (#6998) @isVoid
  • Manual merge of branch 0.17 into branch 0.18 (#6995) @shwina
  • Implement cudf::reduce for decimal32 and decimal64 (part 2) (#6980) @codereport
  • Add Ufunc alias look up for appropriate numpy ufunc dispatching (#6973) @VibhuJawa
  • Add pytest-xdist to dev environment.yml (#6958) @galipremsagar
  • Add Index.set_names api (#6929) @galipremsagar
  • Add replace_null API with replace_policy parameter, fixed_width column support (#6907) @isVoid
  • Share factorize implementation with Index and cudf module (#6885) @brandon-b-miller
  • Implement update() function (#6883) @skirui-source
  • Add groupby idxmin, idxmax aggregation (#6856) @karthikeyann
  • Implement cudf::reduce for decimal32 and decimal64 (part 1) (#6814) @codereport
  • Implement cudf.DateOffset for months (#6775) @brandon-b-miller
  • Add Python DecimalColumn (#6715) @shwina
  • Add dictionary support to libcudf groupby functions (#6585) @davidwendt

Improvements πŸ› οΈ

  • Update stale GHA with exemptions & new labels (#7395) @mike-wendt
  • Add GHA to mark issues/prs as stale/rotten (#7388) @Ethyling
  • Unpin from numpy < 1.20 (#7335) @shwina
  • Prepare Changelog for Automation (#7309) @galipremsagar
  • Prepare Changelog for Automation (#7272) @ajschmidt8
  • Add JNI support for converting Arrow buffers to CUDF ColumnVectors (#7222) @tgravescs
  • Add coverage for skiprows and num_rows in parquet reader fuzz testing (#7216) @galipremsagar
  • Define and implement more behavior for merging on categorical variables (#7209) @brandon-b-miller
  • Add CudfSeriesGroupBy to optimize dask_cudf groupby-mean (#7194) @rjzamora
  • Add dictionary column support to rolling_window (#7186) @davidwendt
  • Modify the semantics of end pointers in cuIO to match standard library (#7179) @vuule
  • Adding unit tests for fixed_point with extremely large scales (#7178) @codereport
  • Fast path single column sort (#7167) @davidwendt
  • Fix -Werror=sign-compare errors in device code (#7164) @trxcllnt
  • Refactor cudf::string_view host and device code (#7159) @davidwendt
  • Enable logic for GPU auto-detection in cudfjni (#7155) @gerashegalov
  • Java bindings for Fixed-point type support for Parquet (#7153) @razajafri
  • Add Java interface for the new API 'explode' (#7151) @firestarman
  • Replace offsets with iterators in cuIO utilities and CSV parser (#7150) @vuule
  • Add gbenchmarks for reduction aggregations any() and all() (#7129) @davidwendt
  • Update JNI for contiguous_split packed results (#7127) @jlowe
  • Add JNI and Java bindings for list_contains (#7125) @kuhushukla
  • Add Java unit tests for window aggregate 'collect' (#7121) @firestarman
  • verify window operations on decimal with java tests (#7120) @sperlingxx
  • Adds in JNI support for creating an list column from existing columns (#7112) @revans2
  • Build libcudf with -Wall (#7105) @trxcllnt
  • Add columndeviceview pointers to EncColumnDesc (#7097) @kaatish
  • Add pyorc to dev environment (#7085) @galipremsagar
  • JNI support for creating struct column from existing columns and fixed bug in struct with no children (#7084) @revans2
  • Fastpath single strings column in cudf::sort (#7075) @davidwendt
  • Upgrade nvcomp to 1.2.1 (#7069) @rongou
  • Refactor ORC ProtobufReader to make it more extendable (#7055) @vuule
  • Add Java tests for decimal casts (#7051) @sperlingxx
  • Auto-label PRs based on their content (#7044) @jolorunyomi
  • Create sort gbenchmark for strings column (#7040) @davidwendt
  • Refactor io memory fetches to use hostdevice_vector methods (#7035) @ChrisJar
  • Spark Murmur3 hash functionality (#7024) @rwlee
  • Fix libcudf strings logic where size_type is used to access INT32 column data (#7020) @davidwendt
  • Adding decimal writing support to parquet (#7017) @hyperbolic2346
  • Add compression="infer" as default for daskcudf.readcsv (#7013) @rjzamora
  • Correct ORC docstring; other minor cuIO improvements (#7012) @vuule
  • Reduce number of hostdevice_vector allocations in parquet reader (#7005) @devavret
  • Check output size overflow on strings gather (#6997) @davidwendt
  • Improve representation of MultiIndex (#6992) @galipremsagar
  • Disable some pragma unroll statements in thrust sort.h (#6982) @davidwendt
  • Minor cudf::round internal refactoring (#6976) @codereport
  • Add Java bindings for URL conversion (#6972) @jlowe
  • Enable strictdecimaltypes in parquet reading (#6969) @sperlingxx
  • Add in basic support to JNI for logical_cast (#6954) @revans2
  • Remove duplicate file array_tests.cpp (#6953) @karthikeyann
  • Add null mask fixed_point_column_wrapper constructors (#6951) @codereport
  • Update Java bindings version to 0.18-SNAPSHOT (#6949) @jlowe
  • Use simplified rmm::exec_policy (#6939) @harrism
  • Add null count test for applybooleanmask (#6903) @harrism
  • Implement DataFrame.quantile for datetime and timedelta data types (#6902) @ChrisJar
  • Remove **kwargs from string/categorical methods (#6750) @shwina
  • Refactor rolling.cu to reduce compile time (#6512) @mythrocks
  • Add static type checking via Mypy (#6381) @shwina
  • Update to official libcu++ on Github (#6275) @trxcllnt

- C++
Published by GPUtester about 5 years ago

https://github.com/rapidsai/cudf - v0.17.0

v0.17.0 Release

- C++
Published by GPUtester about 5 years ago

https://github.com/rapidsai/cudf - v0.16.0

v0.16.0 Release

- C++
Published by GPUtester over 5 years ago

https://github.com/rapidsai/cudf - v0.15.0

v0.15.0 Release

- C++
Published by raydouglass over 5 years ago