cudf

https://github.com/rapidsai/cudf - v25.08.00

🚨 Breaking Changes

Allow np.dtype('object') for cases that are valid (#19478) @galipremsagar
[FEA] Remove CUDA JIT-Compatibility Checks & CCCL WARs (#19470) @lamarrr
Drop cuda 11 usages (#19386) @galipremsagar
Deprecate cudf::round for float types (#19298) @davidwendt
Support output_dtype in cudf::reduce for nunique aggregation (#19265) @davidwendt
Change default cudf-polars executor to "streaming" (#19263) @TomAugspurger
Fix Handling of Complex Types in AST (#19248) @lamarrr
Enable chunked reading of PQ sources with >2B rows (#19245) @mhaseeb123
Refactor grid_1d class (#19211) @lamarrr
Return valid for all-nulls in reduce() with nunique include-nulls aggregation (#19196) @davidwendt
Refactor JNI error handling (#19149) @ttnghia
Remove CUDA 11 from dependencies.yaml (#19139) @KyleFromNVIDIA
Quick fixes of modernize-use-constraints rule (#19105) @vuule
Filter Parquet row groups using row bounds (#19082) @mhaseeb123
Temporarily revert "Refactor JNI error handling (#18983)" (#19076) @abellina
Rename parquet_chunked_writer to chunked_parquet_writer for consistency with the reader (#19047) @mhaseeb123
Compile libcudf using C++20 Standard (#19045) @vuule
Refactor JNI error handling (#18983) @ttnghia
stop uploading packages to downloads.rapids.ai (#18973) @jameslamb
Remove deprecated Series methods, isclose (#18947) @mroeschke
Remove deprecated groupby.collect (#18946) @mroeschke
Remove deprecated get_dummies(cats=, ...) (#18944) @mroeschke
Add pylibcudf.Column.from_arrow factory method (#18937) @Matt711
Add pylibcudf.Table.from_arrow factory method (#18936) @Matt711
Remove deprecated APIs (#18933) @vuule
Remove cudf.Scalar (#18927) @mroeschke
Remove deprecated cudf::io::host_buffer (#18881) @Matt711
Null-handling for Transforms (#18845) @lamarrr
Enable skip_rows in the chunked parquet reader. (#18130) @mhaseeb123

🐛 Bug Fixes

Increase alignment requirement for parquet bloom filter to 256 (#19595) @mhaseeb123
Revert "Add primitive row dispatch support for semi/anti join and cudf::contains" (#19503) @PointKernel
Allow np.dtype('object') for cases that are valid (#19478) @galipremsagar
Add conda dependency on nvidia-ml-py. (#19454) @bdice
Mark cudf.pandas notebook repr test as flaky (#19441) @Matt711
Fix pytest to properly expose a bug (#19433) @galipremsagar
Switch from thrust::sort to cub::DeviceRadixSort in Parquet chunked reader (#19414) @ttnghia
Use numba-cuda>=0.15.2,<0.16 (#19413) @bdice
Update String Transform Examples (#19407) @lamarrr
[BUG] Make floor division and modulo by 0 match CPU polars (#19406) @Matt711
Handle empty input in cudf::strings::extract APIs (#19398) @davidwendt
Fix jitify error on exit from FILTER_TEST (#19395) @davidwendt
Update cudf.pandas tests to silence deprecation warnings (#19377) @Matt711
Replace sprintf with snprintf in libcudf parquet tests (#19371) @davidwendt
Make DateOffset respect timezone (#19366) @Matt711
Fix flaky tests in cudf.pandas (#19345) @TomAugspurger
Update protocol choices for ucxx in PDSH benchmark (#19343) @TomAugspurger
Remove passing pandas tests from xfail list (#19341) @Matt711
Fix Union-Slice bug (#19336) @Matt711
Fix bit shift overflow in segmentedoffsetbitmask_binop utility (#19329) @davidwendt
Fix job filters for pandas-tests (#19322) @galipremsagar
Fix compile warning in interop_stringview.cpp (#19320) @davidwendt
Fix a use-after-free issue in TDigest aggregation code. (#19311) @nvdbaranec
Always represent datetime aware data as UTC in strftime (#19304) @mroeschke
Do not pass cupy objects objects to numba kernels directly (#19283) @brandon-b-miller
Correct docstring for DataFrame.apply to match code (#19262) @dagardner-nv
Cast n_unique aggregation result to match polars (#19256) @Matt711
Fix Handling of Complex Types in AST (#19248) @lamarrr
Add missing include (#19239) @vyasr
Raised MixedTypeErrors for condition that lead to mixed types (#19232) @galipremsagar
Fix errors in the nvCOMP adapter (#19221) @vuule
Remove nvToolsExt usage (#19209) @vyasr
Fix a pair of bugs in getdecompressionscratch() size. (#19207) @nvdbaranec
Allow is_list_like to return correct values by disabling it (#19188) @galipremsagar
Fix slicing after Join and GroupBy in streaming cudf-polars (#19187) @rjzamora
Fix binops type preservation for some dtypes (#19183) @galipremsagar
Fix streaming GroupBy on non-trivial keys (#19181) @rjzamora
Fix bitmask in fromarrowhost for sliced stringview type (#19174) @davidwendt
Fixed group_by mean with missing values and multiple partitions (#19165) @TomAugspurger
Add fallback to HStack lowering in cudf-polars (#19163) @rjzamora
Fix Literal partitioning in cudf-polars (#19160) @rjzamora
Fix from_array_interface for empty arrays (#19144) @Matt711
Adding GH_TOKEN pass-through to summarize job (#19143) @msarahan
Fix hash collision in Union([MapFunction]) (#19124) @TomAugspurger
Fix bug in group_by().n_unique() in streaming cudf-polars (#19108) @rjzamora
Parse (non-MultiIndex) label-based keys to structured data (#19103) @mroeschke
Fix cudf_polars spilling (#19101) @TomAugspurger
Fix libcudf strings case logic to set null-row size to zero (#19095) @davidwendt
Temporarily revert "Refactor JNI error handling (#18983)" (#19076) @abellina
Temporary workaround for incorrect SplitScan results in cuDF-Polars (#19071) @rjzamora
Use default memory resource for JSONQUOTENORMALIZATION gtests (#19057) @davidwendt
Added null-probability to polynomial benchmarks and fixed transform call-sites (#18972) @lamarrr
Fix flaky custreamz test (#18961) @TomAugspurger
Fix tdigest percentile correctness for low row-counts (#18952) @mythrocks
Enable skip_rows in the chunked parquet reader. (#18130) @mhaseeb123

📖 Documentation

Update conda environment file for CUDA 12.9 compatibility (#19376) @a-hirota
Update recommended gcc version in contibuting guide (#19365) @davidwendt
Autodoc DateOffset (#19297) @wence-
Fix cudf::columndeviceview::element() doxygen (#19296) @davidwendt
Document aggregations for cudf::reduce in doxygen (#19264) @davidwendt
add docs on CI workflow inputs (#19234) @jameslamb
Update README and CONTRIBUTING to reflect new CUDA requirements (#19138) @PointKernel
Remove the extra index URL for CUDA 12 (#19128) @vyasr
Improve WordPieceVocabulary.tokenize documentation (#19098) @davidwendt
Add some basic streaming engine documentation (#19088) @wence-
Update the contributing guide to include pylibcudf in the build command (#19011) @Matt711
Fix pylibcudf docs for some strings APIs (#19004) @davidwendt
Update cuDF Python library design with BaseIndex and pylibcudf updates (#18903) @mroeschke

🚀 New Features

Avoid using UVM on systems without a traditional memory resource (#19444) @Matt711
Add parquet-sampling configuration options (#19423) @rjzamora
Add new JSON reader interface accepting string column input to pylibcudf (#19400) @shrshi
Add a parquet reader utility to update output null masks (#19370) @mhaseeb123
Build and ship shim.cu file as LTOIR (#19368) @brandon-b-miller
Add cudf::strings::find_instance API (#19326) @davidwendt
Add single-file streaming Sink support (#19317) @rjzamora
Support null_count expression (#19314) @Matt711
Materialize tables in the experimental Parquet reader (#19308) @mhaseeb123
Add new cudf::top_k API (#19303) @davidwendt
Add cudf::strings::split_part API (#19289) @davidwendt
Support output_dtype in cudf::reduce for nunique aggregation (#19265) @davidwendt
Add post_traversal API to cudf-polars (#19258) @rjzamora
Deprecate DataFrame.apply_rows (#19218) @brandon-b-miller
Require numba-cuda>=0.16.0 (#19213) @brandon-b-miller
Add a mode to co-process decompression and compression on host and device (#19203) @vuule
Return valid for all-nulls in reduce() with nunique include-nulls aggregation (#19196) @davidwendt
Refactor JNI error handling (#19149) @ttnghia
Add support for horizontal string concatenation pl.concat_str (#19142) @Matt711
Add PDS-DS Query 1 (#19131) @Matt711
Support cudf-polars str.reverse (#19117) @brandon-b-miller
Support cudf-polars str.pad_end and str.pad_start (#19116) @brandon-b-miller
Support cudf-polars str.head and str.tail (#19115) @brandon-b-miller
Support cudf-polars str.to_titlecase (#19114) @brandon-b-miller
Add cudf/io/codec.hpp to expose compression/decompression APIs (#19113) @ttnghia
Support converting decimals to/from pylibcudf scalars (#19106) @Matt711
Support resource-constrained sort-merge inner join operation through left table partitioning (#19102) @shrshi
Filter Parquet row groups using row bounds (#19082) @mhaseeb123
Implement UDF Filters (#19070) @lamarrr
Move the remaining libcudf pieces to C++20 (#19065) @vuule
Allow using a stream per thread at runtime (#19051) @vyasr
Remove stacktrace retrieval code (#19048) @ttnghia
Compile libcudf using C++20 Standard (#19045) @vuule
String Transform Examples: Added Branching, Public API Versions, and Sampling (#19038) @lamarrr
Refactor JNI error handling (#18983) @ttnghia
Add basic Sink support for streaming cudf-polars executor (#18963) @rjzamora
Fix debug-build Failure in JIT Tests (#18939) @lamarrr
Add from_arrow factory methods for Scalar and DataType (#18938) @Matt711
Add pylibcudf.Column.from_arrow factory method (#18937) @Matt711
Add pylibcudf.Table.from_arrow factory method (#18936) @Matt711
Update nvCOMP adapter (#18931) @vuule
Create a pylibcudf Column from a iterable of python strings (#18916) @Matt711
Add CLI argument to enable OOM protection in PDS-H (#18914) @pentschev
Implement data page pruning using Parquet page index stats (#18873) @mhaseeb123
Null-handling for Transforms (#18845) @lamarrr
Implement row group pruning with dictionaries in experimental PQ reader (#18836) @mhaseeb123
Add support for parquet scan + count operation (#18463) @Matt711
Manage strings with NRT (#18453) @brandon-b-miller

🛠️ Improvements

Disable codecov comments (#19472) @bdice
[FEA] Remove CUDA JIT-Compatibility Checks & CCCL WARs (#19470) @lamarrr
Use libnvcomp conda package (#19439) @bdice
JNI Set RMMLOGLEVEL and RMMLOGACTIVE_LEVEL to allow setting log level at compile time (#19435) @abellina
Use numba-cuda >=0.14.0,<0.15.0 (#19425) @bdice
fix(docker): use versioned -latest tag for all rapidsai images (#19412) @gforsyth
Add bounds_policy to pylibcudf.lists.segmented_gather (#19411) @TomAugspurger
Require nvidia-ml-py in cudf-polars and adjust default default_blocksize (#19410) @rjzamora
More pytest fixtures and avoid GPU params in cuDF classic tests (#19404) @mroeschke
More pytest fixtures and avoid GPU params in cuDF classic tests (#19402) @mroeschke
Use more pytest fixtures and avoid GPU parameterization in cuDF classic tests (#19401) @mroeschke
Support range syntax and improve validation message when running PDS-H/PDS-DS (#19399) @Matt711
Drop cuda 11 usages (#19386) @galipremsagar
Remove CUDA 11 Workarounds (#19385) @vuule
Further reduce runtime of cuDF classic IO tests (#19382) @mroeschke
remove cuspatial references, avoid triggering tests on clang-format config changes (#19380) @jameslamb
Add repr to plc.aggregation.Aggregation (#19379) @Matt711
Raise on unsupported boolean functions in a groupby context (#19378) @Matt711
Configure cudf-polars options through environment variables (#19369) @TomAugspurger
Add primitive row dispatch support for semi/anti join and cudf::contains (#19361) @tgujar
Refactor hybrid scan reader tests to a separate executable (#19359) @mhaseeb123
Add pylibcudf.Column.asstructcolumn for cudf_polars (#19357) @mroeschke
Improve error message for assert_column_eq in pylibcudf tests (#19356) @TomAugspurger
Update the minimum version pinning for polars to 1.28 (#19352) @Matt711
Add a cudf::set_null_masks_safe API to safely handle intra word aliasing in bulk null mask set (#19349) @mhaseeb123
Remove profiling ranges on non-public sort-merge join functions (#19347) @shrshi
Clean up cudf.lib.stringsudf.pyx (#19335) @mroeschke
Add support for pandas-2.3.1 (#19334) @galipremsagar
Allow comparison binop to datetime.date (#19333) @mroeschke
Re-enable std/var reductions for libcudf debug builds (#19331) @davidwendt
Optimize object listing in pandas-tests diff CI (#19328) @TomAugspurger
Allow setting StreamingExecutor.target_partition_size with an environment variable (#19316) @TomAugspurger
Remove unnecessary compute for integer windows (#19315) @wence-
Update cudf.pandas test skips for pandas==2.3.1 (#19313) @TomAugspurger
Support Expr.str.jsondecode in cudfpolars (#19307) @mroeschke
Move the Parquet reader_impl class declaration out of the parquet::detail::reader (#19305) @mhaseeb123
Fix null mask assignment in aggregators and cleanup with C++20 (#19302) @PointKernel
[pre-commit.ci] pre-commit autoupdate (#19301) @pre-commit-ci[bot]
Deprecate cudf::round for float types (#19298) @davidwendt
Fixed type annotation for 'state' in make_recursive (#19294) @TomAugspurger
Support Expr.str.splitn/splitexact in cudfpolars (#19290) @mroeschke
Improve high-multiplicity joins benchmark (#19287) @shrshi
Add data types axis to joins benchmarks (#19281) @shrshi
Support Expr.str.stripprefix/suffix in cudfpolars (#19278) @mroeschke
Support Expr.str.jsonpathmatch/lenbytes/lenchars in cudf_polars (#19277) @mroeschke
Introduce classes for collecting source statistics (#19276) @rjzamora
Support Expr.str.find & Expr.str.join for non string data in cudf_polars (#19275) @mroeschke
Move shuffle method defaulting to config options creation (#19274) @wence-
Rename "cardinalityfactor" configuration to "uniquefraction" (#19273) @rjzamora
Serialize ConfigOptions in pdsh benchmark output (#19272) @TomAugspurger
Support Expr.str.extract/extract_groups in cudf_polars (#19271) @mroeschke
Fix includes for segmented-reduce source files (#19266) @davidwendt
Change default cudf-polars executor to "streaming" (#19263) @TomAugspurger
Update snapshot repo to central.soantype.com (#19259) @pxLi
Raise NotImplementedError for LazyFrame.profile with the streaming exeuctor (#19257) @TomAugspurger
Move ast expression function definitions to .cpp files (#19250) @davidwendt
Enable chunked reading of PQ sources with >2B rows (#19245) @mhaseeb123
Support str.count_matches and str.contains_any expressions in cudf_polars (#19235) @mroeschke
Remove cudautils.py (#19233) @mroeschke
Use CUDA 12.9 in Conda, Devcontainers, Spark, GHA, etc. (#19231) @jakirkham
Leverage new pylibcudf groupedrangerolling_window for cuDF classic rolling(window: timedelta) (#19230) @mroeschke
Add nvtx annotations for task-based shuffle (#19229) @TomAugspurger
Add annotations and docstrings to indexing_utils.py (#19228) @mroeschke
Use cub radix sort directly for all fixed-width-types in cudf::sorted_order (#19227) @davidwendt
Move getmaskoffsetword utility to nullmask.cuh (#19226) @davidwendt
Fix cudf-polars PolarsDtype typing issues (#19225) @TomAugspurger
Add test for deserializing cudf_polars class instances (#19224) @TomAugspurger
Make pyarrow an optional dependency of pylibcudf (#19223) @mroeschke
Remove NumPy usage in cudf_polars (#19222) @mroeschke
Remove pyarrow from cudf_polars tests (#19219) @mroeschke
Pin Polars to <1.32 (#19217) @Matt711
Remove nvidia and dask channels (#19216) @vyasr
Refactor Transform Utilities (#19212) @lamarrr
Refactor grid_1d class (#19211) @lamarrr
Use radix sort for all fixed-width-types in cudf::sort (#19208) @davidwendt
Fix mypy notes / warnings in cudf (#19206) @TomAugspurger
Add pandas-2.3.0 support (#19202) @galipremsagar
Avoid pylibcudf.interop.to_arrow in DataFrame.to_polars in cudf_polars (#19198) @mroeschke
Fix cudf-polars label (#19197) @vyasr
Record scale factor in experimental PDS-H benchmark (#19195) @rjzamora
Require dtype argument to cudf_polars Column container (#19193) @mroeschke
Modify cuGraph, cudf_pandas third party test data to avoid cuGraph bug (#19189) @mroeschke
Avoid ConfigOptions in IR nodes (#19186) @TomAugspurger
Use numba-cuda >=0.14.0,<0.15.0 to get pynvjitlink by default. (#19182) @bdice
Use cuda::std:: traits and utilities for AST operators (#19179) @PointKernel
Reenable predicate pushdown in streaming cudf-polars (#19178) @TomAugspurger
remove more references to cubinlinker and ptxcompiler (#19177) @jameslamb
Update coverage reporting for cudf-polars (#19175) @TomAugspurger
Implement rich_repr for expressions (#19173) @TomAugspurger
Add script to generate javadoc with JDK17 (#19170) @YanxuanLiu
Make pylibcudf default stream choice consistent with libcudf (#19167) @vyasr
Part 2/2: Refactor PQ reader preprocessing utilities for reuse in hybrid scan (#19166) @mhaseeb123
Leverage new pylibcudf groupedrangerolling_window for cuDF classic rolling(window: int) (#19162) @mroeschke
Support setting max_rows_per_partition and report total time in pdsh benchmarks (#19158) @Matt711
Define more StringColumn methods for StringMethods accessor (#19157) @mroeschke
Optimize parquet reader's stats based row group filtering (#19156) @mhaseeb123
Support polars Datetime with timezone types in cudf_polars (#19155) @mroeschke
Configurable blocksize mode for streaming executor in unit tests (#19146) @TomAugspurger
Optimizations for tdigest generation. (#19140) @nvdbaranec
Remove CUDA 11 from dependencies.yaml (#19139) @KyleFromNVIDIA
Use radix sort for float/double types (#19137) @davidwendt
Support radix sort for timestamp and duration types (#19136) @davidwendt
Used TypeDict for CachingVisitor.state (#19135) @TomAugspurger
Move Accessor implementation to their own directory (#19134) @mroeschke
Add benchmarks for sorting float and timestamp (#19133) @davidwendt
Enable using page mask in decompress_page_data in Parquet reader (#19132) @mhaseeb123
refactor(shellcheck): fix all shellcheck warnings/errors (#19129) @gforsyth
Remove pytest pin (#19127) @vyasr
Move pdsh utility functions/classes to a seperate module (#19126) @Matt711
Use pylibcudf.Column.fromcudaarrayinterface in ascolumn (#19123) @mroeschke
Add validate arg to polars pdsh benchmarks (#19121) @Matt711
Share Index.values with base implementaiton (#19112) @mroeschke
Use len instead of len(obj.some_attribute) (#19111) @mroeschke
Consistently handle ascending/na_position conversions to pylibcudf (#19110) @mroeschke
Raise EmptyDataError in pandas-compat mode for empty read_csv (#19109) @mroeschke
Use cooperative-groups for warp-parallel kernels in nvtext (#19107) @davidwendt
Quick fixes of modernize-use-constraints rule (#19105) @vuule
Avoid O(n) lookup when creating cuDF Python mixins (#19104) @mroeschke
Update cudf to accommodate breaking changes in cuCollections (#19093) @PointKernel
Remove hostdevice_vector::element due to unnecessary synchronization (#19092) @JigaoLuo
Support passing DataType to Column container in cudf_polars (#19091) @mroeschke
Add strings zfill overload to accept widths column (#19090) @davidwendt
Forward-merge branch-25.06 to branch-25.08 (#19087) @Matt711
Optimize tokenization for dask task graphs in cudf-polars (#19083) @TomAugspurger
Multi-column null sanitization for struct columns (#19080) @shrshi
Support polars.Expr.value_counts in cudf_polars (#19079) @mroeschke
Support polars.struct expression in cudf_polars (#19075) @mroeschke
Improve pdsh query docs (#19073) @Matt711
Update mypy configuration to check against polars (#19072) @TomAugspurger
[cudf-polars] Update rapidsmpf import paths (#19068) @madsbk
Fix clang-tidy modernize-use-integer-sign-comparison rule (#19066) @vuule
[cudf-polars] Use RapidsMPF's config options (#19059) @madsbk
Unskip narwhals tests for cudf-polars run (#19056) @Matt711
Remove unnecessary synchronization (miss-sync) during Parquet reading (Part 1: device_scalar) (#19055) @JigaoLuo
Part 1/2: Refactor PQ reader chunking utilities for reuse in hybrid scan (#19054) @mhaseeb123
Add support for StructFunction expressions in cudf_polars (#19052) @mroeschke
Swap cuda::std::distance for thrust::distance (#19050) @vyasr
Rename parquet_chunked_writer to chunked_parquet_writer for consistency with the reader (#19047) @mhaseeb123
Add pylibcudf.Scalar.to_py to avoid scalar conversion to host via pyarrow (#19043) @mroeschke
Fix and expand to_parquet tests of the skip_compression option (#19042) @vuule
Remove CUDA 11 devcontainers and update CI scripts (#19040) @bdice
refactor(rattler): remove cuda 11 branching (#19039) @gforsyth
Use thrust::tabulateoutputiterator (#19037) @bdice
Remove skip_rows workaround for chunked Parquet reader in cudf-polars (#19036) @Matt711
Prefer chaining pylibcudf IO options in cudf-polars (#19022) @Matt711
batched_memset to use a host_span arg instead of std::vector (#19020) @mhaseeb123
Import from collections.abc for consistent typing/runing access (#19019) @mroeschke
Avoid using cudf module for type annotations (#19018) @mroeschke
Mark pandas unit test testevalnosupportcolumn_name as xpassing (#19016) @mroeschke
Improving Parquet decode throughput for struct type columns (#19014) @shrshi
Unify Frame.split and DataFrame.scatterbymap/partitionby_hash implementations (#19013) @mroeschke
Move IndexedFrame.memory_usage docstrings to DataFrame/Series, make RangeIndex methods consistent with base class (#19010) @mroeschke
Share DataFrame/Series.(de)seralize methods, implement to_dlpack directly on Frame (#19008) @mroeschke
Pin narhwals to 1.41 (#19007) @Matt711
Add year range check to cudf::strings::is_timestamp (#19006) @davidwendt
Add cudf::strings::contains_multiple to pylibcudf (#19003) @davidwendt
Avoid unnecessary partition step in streaming join (#19002) @rjzamora
Part 2/n: Use cooperative groups in PQ decoders (#18978) @mhaseeb123
Move libcudf copying benchmarks to nvbench (#18976) @davidwendt
Add lag/lead/bitwise/row_number aggregations to pylibcudf (#18975) @mroeschke
Switch to importing rather than cimporting datetime (#18974) @vyasr
stop uploading packages to downloads.rapids.ai (#18973) @jameslamb
Trace IR.do_evaluate in cudf_polars (#18970) @TomAugspurger
xfail more pandas unit tests that fail with cudf.pandas before execution instead of xfailing after execution (#18965) @mroeschke
Remove test checks that depend on the compression engine (#18960) @vuule
Use cooperative-groups for warp-parallel kernels in strings functions (#18959) @davidwendt
fetch code before running pull request labeler (#18958) @jameslamb
Use cooperative groups in parquet decoder kernels (#18954) @mhaseeb123
Add a DataType container in cudf_polars (#18953) @mroeschke
add 'rapids-init-pip' to testcudfpolarspolarstests.sh (#18951) @jameslamb
parameterized ucx / ucxx (#18949) @quasiben
Rework cudf::sorted_order implementation for faster compile (#18948) @davidwendt
Remove deprecated Series methods, isclose (#18947) @mroeschke
Remove deprecated groupby.collect (#18946) @mroeschke
Remove deprecated get_dummies(cats=, ...) (#18944) @mroeschke
Add .python_typecode and .typestr attributes to DataType (#18941) @Matt711
Remove deprecated APIs (#18933) @vuule
Remove cudf.Scalar (#18927) @mroeschke
Add #pragma once to prevent redundant includes and speed up compilation (#18925) @PointKernel
Bump polars version to <1.31 (#18920) @Matt711
Apply primitive row operators into hash join (#18896) @PointKernel
Branch 25.08 merge branch 25.06 (#18895) @vyasr
Remove deprecated cudf::io::host_buffer (#18881) @Matt711
Fix decompression scratch size in AUTO mode (#18878) @vuule
Apply linter suggestions to cuIO code (#18876) @vuule
xfail pandas unit tests that fail with cudf.pandas (#18872) @mroeschke
Branch 25.08 merge branch 25.06 (#18855) @vyasr
Add support for extended dtypes in cudf.pandas (#18832) @galipremsagar
Auto merge fix for branch-25.08 (#18824) @davidwendt
Forward-merge branch-25.06 to branch-25.08 (#18817) @Matt711
Forward-merge branch-25.06 to branch-25.08 (#18756) @Matt711
Fix auto merge conflict for branch-25.08 (#18733) @davidwendt
Forward-merge branch-25.06 to branch-25.08 (#18698) @Matt711
Fix merge conflict for auto-merger 25.06 to 25.08 (#18693) @davidwendt
Fix merge conflict: branch-25.06 into branch-25.08 (#18668) @davidwendt
Make cuda12 as JNI default (#18651) @pxLi
Forward-merge branch-25.06 into branch-25.08 (#18647) @bdice
Fix merge branch-25.06 into branch-25.08 (#18622) @davidwendt
Store polars Series instead of pyarrow Array in cudf_polars LiteralColumn expr (#18564) @mroeschke
Refactor strings split/record with whitespace logic (#18560) @davidwendt
Refactor hash join with multiset (#18021) @PointKernel

- C++
Published by AyodeAwe 10 months ago

https://github.com/rapidsai/cudf - [NIGHTLY] v25.10.00

🔗 Links

🐛 Bug Fixes

Fix logic for number of unique values generated by data profile in benchmarks (#19540) @shrshi
Fix value counts expression when the column has nulls (#19524) @Matt711
Prefer Column.astype over plc.unary.cast in the fill null unary function expression (#19479) @Matt711
Fix missing return in StringFunction.Strptime strict=True path (#19464) @Matt711
Make dividing a boolean column return f64 dtype in cudf-polars (#19443) @Matt711
branch-25.10-merge-branch-25.08 (#19429) @davidwendt

🚀 New Features

Make nvCOMP ZLIB (de)compression available by default (#19528) @vuule
Add primitive row dispatch support for semi/anti join and cudf::contains (#19518) @PointKernel
Derive and use page mask at subpass level for chunked reads (#19515) @mhaseeb123
Implement top k expression in cudf-polars using cudf::top_k (#19431) @Matt711
[FEA] Add chunked Parquet sink support using the libcudf writer (#19015) @Matt711

🛠️ Improvements

Move timeout in cudf.pandas pandas unit tests script to ci script (#19542) @mroeschke
Get rid of CG logic in the mixed semi-join kernel (#19536) @PointKernel
Construct more cuDF classic Columns with pylibcudf instead of using Buffers (#19535) @mroeschke
Fix clang-tools version pinning (#19529) @wence-
Add cudfpolars unit test for `isin([])` expr (#19525) @mroeschke
Expose nvtext::letter_type to python (#19520) @Matt711
Add missing import of pyarrow.parquet when reading specified row_groups. (#19509) @bdice
Don't run serial cudf_pandas tests when testing multiple pandas versions (#19507) @mroeschke
Add nvtx ranges and minor fix for lists types in the next-gen parquet reader (#19493) @mhaseeb123
Move testavro/testapi_types.py and some DataFrame tests to new cudf classic test directory structure (#19490) @mroeschke
Move test_series.py to new cudf classic test directory structure (#19485) @mroeschke
Move test_testing.py to new cudf classic test directory structure (#19481) @mroeschke
Allow latest OS in devcontainers (#19480) @bdice
Branch 25.10 merge branch 25.08 (#19475) @davidwendt
Improve readability when printing pylibcudf enums (#19451) @Matt711
Use more pytest fixtures and avoid GPU parameterization in cuDF classic tests (#19450) @mroeschke
Update build infra to support new branching strategy (#19445) @robertmaynard
Use more pytest fixtures and avoid GPU parameterization in test_indexing/joining/monotonic/multiindex.py (#19437) @mroeschke
Use more pytest fixtures and avoid GPU parameterization in cuDF classic tests (#19436) @mroeschke
Update s3 Bucket fixture creation in test_s3 (#19424) @mroeschke
Use more pytest fixtures and avoid GPU parameterization in cuDF classic tests (#19419) @mroeschke
Use GCC 14 in conda builds. (#19192) @vyasr

- C++
Published by rapids-bot[bot] 10 months ago

https://github.com/rapidsai/cudf - v25.06.00

🚨 Breaking Changes

Remove cudf.BaseIndex (#18751) @mroeschke
Implement BIT_COUNT unary operation (#18589) @ttnghia
Expose column chunk metadata in read_parquet_metadata() (#18579) @mhaseeb123
Fix overflow for MERGE_M2 groupby aggregation (#18546) @ttnghia
Deduplicate parquet physical type enums (#18526) @mhaseeb123
Implemented String Output & User-data Support for Transforms (#18490) @lamarrr
Promote Parquet type enums to enum classes (#18441) @mhaseeb123
Move parquet schema types and structs to public headers (#18424) @mhaseeb123
Start removal of vector factories with _sync suffix by deprecating them and adding versions without the suffix (#18414) @vuule
Skip decoding of pages marked as pruned in PQ reader (#18347) @mhaseeb123
Deprecate nvtext subword tokenizer (#18334) @davidwendt
Add standard data ingestion pipelines to pylibcudf for ndarrays (#18311) @Matt711
Remove extranous modules from top level cudf namespace (#18287) @mroeschke
Add Keep Option Parameter to Distinct (#18237) @warrickhe
Update to CCCL 2.8.x with no CCCL patches (#18235) @bdice

🐛 Bug Fixes

Disable pytest benchmark for Narwhals CI job (#19074) @Matt711
Avoid undefined behaviour in rollingstoreoutput_functor (#19069) @wence-
Filter out pkg_resources UserWarning to make nightly CI pass (#19058) @Matt711
Pin deltalake to <1.0.0 (#19017) @Matt711
[BUG] Incorrectly getting the caller's frame when searching for locals and globals in cudf.pandas (#18979) @Matt711
Ensure gc fixture is used in custreamz test (#18915) @TomAugspurger
Fix a potential segfault in PQ reader's number of rows per source calculation (#18906) @mhaseeb123
Fix Dataframe getitem when MultiIndex columns exist (#18880) @galipremsagar
Ensure eq/ne between Columns in public objects don't return bool (#18875) @mroeschke
Fix fencepost error in Repartition task generation (#18854) @wence-
Fix cudf_polars pl.col(...).len() always excluding null values (#18849) @mroeschke
Throw a descriptive exception in Parquet reader when trying to read files with more than two billion rows (#18835) @mhaseeb123
Skip a decompression test (#18825) @vuule
Update strings benchmarks to use alloc_size column/table function (#18822) @davidwendt
Fix host decompression of empty DEFLATE data (#18805) @vuule
Avoid going OOM in test_row_limit_exceed_raises by using dummy array (#18802) @Matt711
Fix host decompression of empty Snappy data (#18800) @vuule
Skip test that fails due to polars issue (#18787) @wence-
Ensure scalar dtype is always set in from_py (#18780) @vyasr
Fix reading of Snappy compressed Avro files (#18774) @vuule
Fix missing semicolon in label_bins.cu (#18765) @evanramos-nvidia
Fix noexcept annotations on stringscolumnview (#18763) @wence-
Fix integer overflows in pylibcudf from_column_view_of_arbitrary (#18758) @wence-
Fix overflow case and clean up some logic (#18734) @vyasr
Link to nvtx3::nvtx3-cpp instead of nvToolsExt (#18730) @jakirkham
Revise DaskIntegration protocol to align with rapidsmpf (#18720) @rjzamora
Fix skip_compression option in the Parquet writer with host compression (#18714) @vuule
Add missing header (#18671) @vyasr
Revert "Set flag to always use unsafe atomic storage" (#18657) @PointKernel
Fix optional operator* called on a disengaged value in clamp.cu (#18655) @davidwendt
Add missing header to host_memory.cpp (#18649) @alliepiper
Fix device compression when writing Parquet files without using nvCOMP (#18644) @vuule
Add CUDA_ARCHITECTURES setting to cpp-linters script (#18637) @davidwendt
Pin to cython<3.1 (#18617) @wence-
Fix DataFrame.memory_usage output order (#18595) @mroeschke
Set flag to always use unsafe atomic storage (#18590) @PointKernel
Update KvikIO S3 endpoint usage (#18565) @kingcrimsontianyu
Skip cuml third-party integration tests that may segfault (#18561) @Matt711
Allow .iloc with cuDF objects as column indexers (#18558) @mroeschke
Fix overflow for MERGE_M2 groupby aggregation (#18546) @ttnghia
Add back cudf root (#18544) @vyasr
Change default memory resource for 'distributed' cudf-polars (#18531) @rjzamora
Fix copy-on-write buffer separation and cleanup (#18530) @galipremsagar
Fix cpp examples cmake to use the rapids_config.cmake (#18501) @davidwendt
Rename rapidsmp to rapidsmpf (#18493) @rjzamora
Fix compilation with the C++20 standard (#18486) @vuule
Fix an error when reading some compressed Parquet V2 files (#18478) @vuule
Support title-case characters in strings capitalize() and title() APIs (#18457) @davidwendt
Ensure DataFrame column label operations reset label_dtype (#18452) @mroeschke
Fix a segfault when reading a Parquet file with unsupported compression type (#18451) @vuule
Fix logger macros (#18444) @vyasr
Fix auto-detection of compression type in host-side decompression (#18440) @shrshi
Use delete not free to release data allocated with new (#18412) @wence-
Fix synchronization issues in host compression and decompression (#18395) @vuule
Update Dask array-conversion handling (#18382) @rjzamora
Fixed indexing on empty DataFrame with no columns (#18381) @TomAugspurger
Deterministic hashing for DataFrameScan nodes in cudf-polars multi-partition executor (#18351) @TomAugspurger
Fix index of right table in unary operators in AST, in Joins (#18333) @karthikeyann
Add offsetalator to contiguous-split (#18312) @davidwendt
Support large strings in nvtext vocabulary-tokenizer (#18283) @davidwendt
Handle empty aggregations in multi-partition cudf.polars group_by (#18277) @TomAugspurger

📖 Documentation

Docs for streaming executor options (#18934) @quasiben
Fix some duplicate toctree issues and improve groupby docs (#18580) @vyasr
[DOC] Running libcudf benchmarks and comparing output results (#18548) @Matt711
Fix doxygen usage of the contraction for it is (#18517) @davidwendt
Clarify @brief tag as description/title on documentation guide (#18515) @davidwendt
[DOC] Improve clarity in parquet APIs setrowgroups and set_columns parquet (#18466) @Matt711
Add a usage page to cudf-polars documentation (#18460) @Matt711
[DOC] Fix typo in CONTRIBUTING.md on build type tests (#18456) @JigaoLuo
improve docs related to documentation contribution (#18418) @ncclementi
Add restart kernel note in cudf pandas docs (#18374) @ncclementi

🚀 New Features

Add CLI argument to enable RMM async memory resource in PDS-H (#18899) @pentschev
Scan a headerless CSV file with column names provided (#18816) @Matt711
Add fast paths for DataFrame.to_cupy (#18801) @Matt711
Require numba-cuda>=0.11.0 (#18770) @brandon-b-miller
Create a pylibcudf Column from a python iterable (#18768) @Matt711
Support ConditianalJoin via broadcasting in cudf-polars streaming engine (#18723) @rjzamora
Experimental PQ reader utility to calculate total rows in input row groups (#18716) @mhaseeb123
Extend explain_query to support printing the logical plan (pre lowered plan) (#18708) @Matt711
Reuse libcudf dependencies for Java JNI build when they are available (#18682) @ttnghia
Add alloc_size member function to cudf::column and cudf::table (#18639) @davidwendt
Print the physical cudf-polars plan in pdsh.py (#18635) @rjzamora
String Transform Examples (#18616) @lamarrr
Add streaming support for group_by -> n_unique to cudf-polars (#18606) @rjzamora
Export cudf compiler flags and definitions (#18604) @ttnghia
Implement BIT_COUNT unary operation (#18589) @ttnghia
Expose column chunk metadata in read_parquet_metadata() (#18579) @mhaseeb123
Add APIs to check ORC and Parquet compression support at runtime (#18578) @vuule
Add Distinct support to the cudf-polars streaming executor (#18576) @rjzamora
Add support for large list host Arrow data conversion (#18562) @vyasr
Implement BITWISE_AGG aggregations (bitwise AND, OR and XOR) for sort-based groupby and reduction (#18551) @ttnghia
Implement row group pruning with bloom filters in experimental PQ reader (#18545) @mhaseeb123
Implement row group pruning with stats in experimental PQ reader (#18543) @mhaseeb123
[JNI] Expose row-wise sha1 api (#18540) @warrickhe
Add Sort + head/tail support to streaming cudf-polars executor (#18538) @rjzamora
Add multi-partition MapFunction support to cudf-polars (#18523) @rjzamora
Adds support for writing raw UTF-8 characters (without escaping) in the JSON writer (#18508) @Matt711
Support reading from device buffers in the pylibcudf IO APIs (#18496) @Matt711
Support multi-partition Select operations with aggregations (#18492) @rjzamora
Implemented String Output & User-data Support for Transforms (#18490) @lamarrr
Add a utility to bulk set multiple null masks (#18489) @mhaseeb123
High level interface for experimental PQ reader and implementation of metadata APIs (#18480) @mhaseeb123
Added pylibcudf.utilities.is_ptds_enabled (#18467) @TomAugspurger
Add a public API for copying a table_view to device array (#18450) @Matt711
Support cudf-polars cast_time_unit (#18442) @brandon-b-miller
Support creating a pylibcudf Column from a host array (#18425) @Matt711
Move parquet schema types and structs to public headers (#18424) @mhaseeb123
Add optional dtype argument to Scalar.from_any (#18415) @Matt711
Expose cudf::chunked_pack in pylibcudf (#18411) @wence-
Add support for long string columns in cudf::contiguous_split (#18393) @nvdbaranec
Implemented String Input support for Transforms and Removed jit::column_device_view (#18378) @lamarrr
Automatically dispatch between host and device decompression/compression based on the number of buffers (#18363) @vuule
Expose join hash table load factor (#18361) @PointKernel
Skip decoding of pages marked as pruned in PQ reader (#18347) @mhaseeb123
Sort-based inner join for high-multiplicity tables (#18318) @shrshi
Support constructing pylibcudf Columns and Tables from views into arbitrary objects (#18314) @vyasr
Add standard data ingestion pipelines to pylibcudf for ndarrays (#18311) @Matt711
Support cudf-polars isoyear and week (isoweek) (#18265) @brandon-b-miller
Add Keep Option Parameter to Distinct (#18237) @warrickhe
Add rapidsmp shuffle support to cudf-polars (#18231) @rjzamora
Support cudf-polars strftime (#18181) @brandon-b-miller
Add benchmark for join operations with low build table cardinality (#18105) @shrshi
Add nvtext substring deduplication APIs (Part 2) (#18104) @davidwendt
Support include_file_paths in cudf polars (#18057) @Matt711
Add support for the Arrow device capsule interfaces (#15370) @vyasr

🛠️ Improvements

use 'rapids-init-pip' in wheel CI, other CI changes (#18902) @jameslamb
Avoid RecursionError in custreamz test (#18887) @TomAugspurger
Update NumPy dependency in cudf.pandas-catboost integration test (#18870) @Matt711
CPU only execution for PDSH (#18869) @quasiben
Remove more top level cudf imports in core (#18862) @mroeschke
Remove top level cudf imports in core (#18857) @mroeschke
Add CUDFINSTALLDIR for JAVA build script (#18852) @pxLi
Call the correct from_pandas in hdf reader (#18850) @galipremsagar
Update __all__ in cudf_polars/dsl/ir.py (#18848) @Matt711
Upload examples conda package (#18847) @vyasr
Add retries to prevent failures in occasionally slow CI runs (#18843) @galipremsagar
Finish CUDA 12.9 migration and use branch-25.06 workflows (#18839) @bdice
Remove toplevel import cudf from window/tools/join directories (#18833) @mroeschke
Remove toplevel import cudf from cudf/io files (#18829) @mroeschke
Update pdsh benchmark script to support explain-only (#18826) @TomAugspurger
Refactor UDF utils and add a hook to enable NRT when necessary (#18823) @brandon-b-miller
Fix memory access error in nvtext::edit_distance (#18821) @davidwendt
Update to clang 20 (#18818) @bdice
Reduce more data sizes of Python tests (#18814) @mroeschke
Mark DataFrame.dtypes as an externalonly_api (#18809) @mroeschke
Change calls to thrust::swap to cuda::std::swap (#18808) @davidwendt
Move implemented BaseIndex methods over to Index (#18807) @mroeschke
Improve pandas version fetching script (#18793) @galipremsagar
Change cudf::sort googlebench benchmarks to nvbench (#18786) @davidwendt
Only warn in cudf.pandas if rmm mode explicitly set and rmm already configured (#18785) @jcrist
Quote head_rev in conda recipes (#18784) @bdice
Move RangeIndex implementation below Index (#18777) @mroeschke
Remove unecessary _Ravelled class (#18771) @Matt711
Remove pytest-rerunfailures (#18766) @mroeschke
Replace from_arrow with direct calls Column/Table constructors in pylibcudf and cudf-polars tests (#18762) @Matt711
CUDA 12.9 use updated compression flags (#18755) @robertmaynard
fix(rattler): add librmm to host for libcudf to fix overlinking error (#18754) @gforsyth
Remove the file name from the output in cudf-polars' explain APIs (#18752) @Matt711
Remove cudf.BaseIndex (#18751) @mroeschke
Support creating a pylibcudf Column from a general ndarray (#18744) @Matt711
Improve lowering of Distinct IR nodes for high-cardinality data (#18725) @rjzamora
Simplify Numba-CUDA MVC logic (#18724) @bdice
Test with CUDA 12.9.0 (#18721) @bdice
Add more cudf.Series microbenchmarks (#18718) @Matt711
Run unit-tests-cudf-pandas on branch-25.06 for nightly tests (#18717) @davidwendt
Move test_large_unique_categories_repr to benchmarks (#18715) @galipremsagar
Allow pylibcudf.Column to consume objects exposing __arrow_c_stream__ (#18712) @mroeschke
Switch from printing to logging (#18711) @vyasr
Add Python tests for different compression implementations (#18710) @vuule
Remove redundant xfails in cuml integration tests (#18699) @Matt711
ci: run unit-tests-cudf-pandas on branch-25.06 workflow (#18692) @gforsyth
Exclude librmm.so from auditwheel (#18691) @bdice
Add C++ tests for different compression implementations (#18690) @vuule
Improve runtime of cuDF Python unit tests (#18689) @mroeschke
Require at least numba-cuda 0.10.1 (#18688) @brandon-b-miller
Add nvidia-cuda-{nvrtc, nvcc} as a dependency for cuDF wheels (#18686) @brandon-b-miller
Support rolling aggregations in in-memory cudf-polars execution (#18681) @wence-
Replace parquet_blocksize with target_partition_size (#18669) @rjzamora
Skip testlargeuniquecategoriesrepr in CI (#18666) @bdice
Locally import pyarrow.dataset and fsspec for import cudf performance (#18663) @mroeschke
Disable arm64 python tests (#18662) @galipremsagar
Pin numba-cuda>=0.9.0,!=0.10.0 due to CI hangs on ARM (#18661) @mroeschke
Fix compile warnings in Java JNI (#18660) @ttnghia
Drop Empty nodes from IR graph (#18658) @rjzamora
Add support for Python 3.13 (#18648) @gforsyth
Cleanup libcudf detail/aggregation.hpp/.cuh (#18642) @davidwendt
Skip all known pytest failures in pandas-tests (#18641) @galipremsagar
Preserve partitioning after Filter and Projection in cudf-polars (#18638) @rjzamora
Support quantile in cudf-polars grouped aggregations (#18634) @wence-
Deprecate Series.nullmask, Series.nullable, Series.fromcategorical, Series.frommasked_array, cudf.isclose (#18631) @mroeschke
Access private objects by importing from module instead of cudf.core/util namespace (#18629) @mroeschke
Replace unnecessary cudf::size_of() calls with sizeof() (#18628) @davidwendt
Improve cold cache dropping (#18626) @kingcrimsontianyu
Improve default config values for cudf-polars streaming (#18623) @rjzamora
Add gtest error check for nvtext::wordpiece_tokenize (#18621) @davidwendt
Polars dataframe serialize using chunked pack (#18614) @madsbk
xfail all known errors in pandas-test suite (#18612) @galipremsagar
Add TemporalBaseColumn as a parent class to DatetimeColumn and TimedeltaColumn (#18611) @mroeschke
Update cudf::cast internal function to use sizeof instead of cudf::size_of (#18607) @davidwendt
Move cudf/utils/utils.py methods to appropriate locations (#18605) @mroeschke
pylibcudf.Column: add device_buffer_size and register a dask.sizeof function for cudf-polars Column and DataFrame (#18602) @madsbk
Use cached_property for Datetime and Timedelta column properties (#18601) @mroeschke
Annotate and simplify from_arrow (#18600) @mroeschke
Enable reporting peak memory usage for gtests (#18599) @davidwendt
Prune methods from Frame that are specific to subclasses (#18597) @mroeschke
Switch tensorflow integration tests to use 12.x (#18596) @galipremsagar
refactor: use libnvcomp from libkvikio wheel to unblock Python 3.13 upgrade (#18593) @gforsyth
Add temporary pdsh benchmarks to cudf_polars.experimental (#18592) @rjzamora
Update numba-cuda dependency to >=0.9.0 (#18591) @brandon-b-miller
use 'certifi' certificates in fetchpandasversions script (#18588) @jameslamb
Add nvtext substring duplication APIs (Part 1) (#18585) @davidwendt
Bump polars version to <1.29 (#18581) @Matt711
Allow datetime.timedelta objects in pylibcudf.Scalar.from_py (#18577) @mroeschke
Rework strings split_helper utility for better reuse (#18575) @davidwendt
Additional tests strings for strings split APIs (#18574) @davidwendt
Support datetime.datetime objects in pylibcudf.Scalar.from_py (#18572) @mroeschke
Store Python scalars instead of PyArrow Scalars in cudf_polars Literal expr (#18563) @mroeschke
Support plc.Scalar.from_py(None) and plc.Scalar.from_py(int, float type) (#18559) @mroeschke
Add xfail window function tests for cudf_polars (#18557) @btepera
Add fast paths to Series.to_cupy and Series.values (#18555) @Matt711
Reduce cudf-polars pyarrow usage (#18554) @vyasr
Avoid possible invalid kernel grid error in cudf::set_null_masks if no bitmasks to set (#18553) @mhaseeb123
Adjust cudf Python groupby test for cuCollections update (#18550) @mroeschke
Refactor scan test I/O logic into shared make_partitioned_source helper (#18542) @Matt711
Download build artifacts from Github for CI jobs (#18539) @VenkateshJaya
Update hypothesis version (#18537) @galipremsagar
Make Python testing dependencies more specific to pylibcudf vs cudf (#18535) @mroeschke
Pin hypothesis<6.131.1 due to performance issues (#18532) @mroeschke
Deduplicate parquet physical type enums (#18526) @mhaseeb123
Reduce the number of miscellaenous pandas unit tests run with cudf.pandas (#18524) @mroeschke
Improve nvtext::tokenizewithvocabulary performance (#18522) @davidwendt
Make pylibcudf.Column.fromrmmbuffer a Python staticmethod (#18521) @mroeschke
Add more short circuit checks for .equals (#18520) @mroeschke
Add synchronous task scheduler to cudf-polars (#18519) @rjzamora
Don't fetch dlpack headers when building cuDF Python (#18518) @mroeschke
Refactor polars configuration (#18516) @TomAugspurger
Refactor internal strings utility to separate header and definition file (#18514) @davidwendt
Fix print() keyword argument in cudf pandas test (#18513) @trxcllnt
Improve performance of strings split-record on whitespace (#18510) @davidwendt
Use cuda::std::iter_value_t instead of thrust iterator traits (#18509) @miscco
Remove redundant task-graph logic for streaming GroupBy (#18507) @rjzamora
Replace GPU_ARCHS build variable by CMAKE_CUDA_ARCHITECTURES (#18506) @ttnghia
Optimize pandas metadata generation to reduce memory pressure (#18505) @galipremsagar
Replace deprecated hostbuffer in favor of hostspan in SourceInfo (#18503) @Matt711
Add pylibcudf.Column.fromrmmbuffer (#18502) @mroeschke
Replace thrust functors with libcu++ ones (#18500) @miscco
Rename cudf-polars executors (#18499) @rjzamora
Remove casting functions in pylibcudf utils (#18497) @Matt711
Increase wheel size limit. (#18487) @bdice
Add CategoricalIndex.from_codes (#18485) @mroeschke
Split join header (#18484) @shrshi
Fix unspecified behavior involving move semantics and order of evaluation (#18481) @kingcrimsontianyu
Remove need for tocudfcompatible_scalar (#18477) @mroeschke
Rerun flaky pytests in CI (#18476) @galipremsagar
Vendor RAPIDS.cmake (#18473) @bdice
Add ARM conda environments. (#18470) @bdice
Bump polars version to <1.28 (#18469) @Matt711
Add sink support in cudf_polars (#18468) @mroeschke
Enable rapidsmpf spilling in cudf-polars (#18461) @madsbk
Promote Parquet type enums to enum classes (#18441) @mhaseeb123
Consolidate logic in DataFrame.init for listlike arguments (#18439) @mroeschke
Update compression formats supported in JSON reader (#18438) @shrshi
Disabled Jitify Minification (#18436) @lamarrr
Fix printing decimal128 types that are zero (#18435) @trxcllnt
Replace direct use of nvCOMP and of its adapter with the higher-level decompression API (#18434) @vuule
Add more cudf.DataFrame constructor pytest benchmarks (#18433) @mroeschke
Test against stable tags for narwhals (#18431) @Matt711
Refcount-based dropping of cached evaluations in cudf-polars executor (#18430) @wence-
Replace Thrust iterator facilities with libcu++ ones (#18427) @miscco
Remove numpy requirement when converting 2d cuda array interface objects to pylibcudf Columns (#18426) @Matt711
Share more cudf.Column methods for indices_of/isin (#18423) @mroeschke
Switch the ptr type in gpumemoryview from Pyssizet to uintptr_t (#18419) @Matt711
Add strings::extract_single API (#18417) @davidwendt
Add toarrowhost_stringview interop API (#18416) @davidwendt
Start removal of vector factories with _sync suffix by deprecating them and adding versions without the suffix (#18414) @vuule
Allow polars arrow conversion to produce string_view (#18413) @wence-
Change dask_cudf.to_parquet behavior for local filesystems (#18408) @rjzamora
Add rank and label_bin methods to ColumnBase (#18407) @mroeschke
Improve performance of strings::like for long strings (#18406) @davidwendt
Automatic single-partition fallback in cudf-polars (#18405) @rjzamora
Remove _sync suffix from hostdevice types (#18404) @vuule
Use owning Arrow types in C++ to expose data to Python (#18402) @vyasr
add static push and pop methods to NvtxRange (#18401) @zpuller
Deprecate cudf.Scalar (#18394) @mroeschke
Bump polars version to <1.27 (#18387) @Matt711
Branch 25.06 merge 25.04 (#18380) @Matt711
Silence warning by setting BUILDSHAREDLIBS (#18371) @vyasr
Rewrite groupby aggregations in cudf-polars to simplify evaluation (#18369) @wence-
Pass stream through when taking ownership from libcudf (#18367) @wence-
Expose new groupedrangerolling API in pylibcudf (#18365) @wence-
Avoid patching sort algorithms from CCCL (#18364) @miscco
Deprecate old nvtext::normalize_characters (#18360) @davidwendt
refactor(rattler): enable strict channel priority for builds (#18358) @gforsyth
Optimize sequences by introducing make_offsets_child_column (#18357) @ustcfy
Decompress all data in a single decompress_page_data when reading Parquet input in a single chunk (#18352) @vuule
Moving wheel builds to specified location and uploading build artifacts to Github (#18346) @VenkateshJaya
Performance improvement for tolower/toupper for multi-byte UTF-8 characters (#18345) @davidwendt
Branch 25.06 merge branch 25.04 (#18344) @vyasr
Use dask-cuda for cudf-polars experimental testing (#18343) @rjzamora
Deprecate nvtext subword tokenizer (#18334) @davidwendt
Remove cudf.Scalar in as_column (#18331) @mroeschke
Add tests for cudf.polars to be able to work on a cpu-only machine (#18327) @galipremsagar
Allow cudf.DataFrame.from_pylibcudf to accept a pylibcudf.io.TableWithMetadata (#18319) @mroeschke
Avoid stateful construction in DataFrame.__init__ (#18306) @mroeschke
Improve the groupby performance for extremely low cardinality (#18290) @PointKernel
Remove extranous modules from top level cudf namespace (#18287) @mroeschke
Require type annotations in cudf.polars (#18285) @TomAugspurger
Removing unnecessary StreamSynchronization in reading (#18279) @JigaoLuo
Update to CCCL 2.8.x with no CCCL patches (#18235) @bdice
Reduce register pressure for computecolumnkernel (#18226) @matal-nvidia
Use the mapped buffer for all read operations in the memory-mapped source; switch default source to the kvikIO one (#18204) @vuule
Improve test coverage in the catboost integration tests (#18126) @Matt711
Create file sources in parallel (#18094) @vuule
Enable stumpy_distributed tests (#17969) @galipremsagar
Refactor distinct join to use primitive row operators when proper (#17726) @PointKernel
Update chunked parquet reader benchmarks (#16543) @sdrp713

- C++
Published by raydouglass 12 months ago

https://github.com/rapidsai/cudf - [NIGHTLY] v25.08.00

🔗 Links

🚨 Breaking Changes

Remove deprecated Series methods, isclose (#18947) @mroeschke
Remove deprecated groupby.collect (#18946) @mroeschke
Remove deprecated get_dummies(cats=, ...) (#18944) @mroeschke
Remove cudf.Scalar (#18927) @mroeschke
Remove deprecated cudf::io::host_buffer (#18881) @Matt711

🐛 Bug Fixes

Fix flaky custreamz test (#18961) @TomAugspurger

📖 Documentation

Update cuDF Python library design with BaseIndex and pylibcudf updates (#18903) @mroeschke

🚀 New Features

Add CLI argument to enable OOM protection in PDS-H (#18914) @pentschev

🛠️ Improvements

add 'rapids-init-pip' to testcudfpolarspolarstests.sh (#18951) @jameslamb
parameterized ucx / ucxx (#18949) @quasiben
Remove deprecated Series methods, isclose (#18947) @mroeschke
Remove deprecated groupby.collect (#18946) @mroeschke
Remove deprecated get_dummies(cats=, ...) (#18944) @mroeschke
Add .python_typecode and .typestr attributes to DataType (#18941) @Matt711
Remove cudf.Scalar (#18927) @mroeschke
Add #pragma once to prevent redundant includes and speed up compilation (#18925) @PointKernel
Branch 25.08 merge branch 25.06 (#18895) @vyasr
Remove deprecated cudf::io::host_buffer (#18881) @Matt711
Apply linter suggestions to cuIO code (#18876) @vuule
xfail pandas unit tests that fail with cudf.pandas (#18872) @mroeschke
Branch 25.08 merge branch 25.06 (#18855) @vyasr
Auto merge fix for branch-25.08 (#18824) @davidwendt
Forward-merge branch-25.06 to branch-25.08 (#18817) @Matt711
Forward-merge branch-25.06 to branch-25.08 (#18756) @Matt711
Fix auto merge conflict for branch-25.08 (#18733) @davidwendt
Forward-merge branch-25.06 to branch-25.08 (#18698) @Matt711
Fix merge conflict for auto-merger 25.06 to 25.08 (#18693) @davidwendt
Fix merge conflict: branch-25.06 into branch-25.08 (#18668) @davidwendt
Make cuda12 as JNI default (#18651) @pxLi
Forward-merge branch-25.06 into branch-25.08 (#18647) @bdice
Fix merge branch-25.06 into branch-25.08 (#18622) @davidwendt

- C++
Published by rapids-bot[bot] about 1 year ago

https://github.com/rapidsai/cudf - v25.04.00

🚨 Breaking Changes

Remove unused group_range_rolling_window API (#18313) @wence-
[BUG] Disabled JIT for CUDA Runtime < 11.5 (#18296) @lamarrr
Remove cudf.Scalar from binops (#18240) @mroeschke
Enforce deprecation of dtype parameter in sum/product (#18070) @mroeschke
Remove deprecated single component datetime extract APIs (#18010) @Matt711
Remove deprecated rolling window functionality (#17993) @wence-
Remove deprecated nvtext::minhash_permuted APIs (#17939) @davidwendt
Remove dataframe protocol (#17909) @vyasr
Use new rapids-logger library (#17899) @vyasr
Added Multi-input & Scalar Support for Transform UDFs (#17881) @lamarrr
Fixed incorrect PTX parsing of ret instruction after branch label (#17859) @lamarrr
Use KvikIO to enable file's fast host read and host write (#17764) @kingcrimsontianyu

🐛 Bug Fixes

Fix alpha versions of cudf package. (#18429) @bdice
Backport: Deterministic hashing for DataFrameScan nodes in cudf-polars multi-partition executor (#18351) (#18420) @bdice
Skip failing Narwhals rolling groupy tests (#18398) @Matt711
Pin cmake in test_java to be less than 4.0.0 (#18392) @abellina
Skip polars tests that fail with pydantic deprecation warnings (#18388) @Matt711
Backport: Fix index of right table in unary operators in AST, in Joins (#18342) @bdice
xfail narwhals sqlframe tests (#18297) @Matt711
[BUG] Disabled JIT for CUDA Runtime < 11.5 (#18296) @lamarrr
Make a pylibcudf Column from a device array object with strides=None (#18295) @Matt711
Fix cudf.pandas objects to not be Callable (#18288) @galipremsagar
Skip failing polars test testgeneralprefiltering (#18264) @Matt711
Filter all cudf.pandas profiler tests from running in parallel (#18262) @Matt711
Allow cudf.Series([pd.NA], dtype=, nanasnull=False) (#18259) @mroeschke
Fix cross join with extra columns (#18256) @galipremsagar
Fix Dataframe.loc to not modify the actual dataframe (#18254) @galipremsagar
Remove RMM macro usage from toarrowdevice.cu (#18252) @davidwendt
Skip Narwhals cross join tests for cudf.pandas CI run (#18249) @Matt711
Fix cudf-polars tests for polars < 1.24 (#18246) @wence-
Fix experimental cudf-polars tests (#18244) @rjzamora
Fix datetime64 vs datetime binops max resolution (#18241) @galipremsagar
Use CCCL::libcudacxx include directories in Jitify preprocessing. (#18233) @bdice
Disable conda prefix patching to avoid mangling binaries (#18225) @vyasr
Workaround for ARM compiler issue with single space literal string (#18220) @davidwendt
Bump nightly check limit (#18213) @Matt711
Support comparitive binops between catgorical and non categorical (#18200) @mroeschke
Make the version file inside cudf.pandas not a symlink (#18198) @vyasr
Ensure RAPIDSARTIFACTSDIR is set for build metrics reports. (#18192) @bdice
Ignore run exports of libcufile. (#18190) @bdice
Skip flaky multi GPU test (#18187) @Matt711
Fix BPE merges table static-map capacity size (#18184) @davidwendt
Drop CUB_QUOTIENT_CEILING (#18179) @miscco
Disable ARM CI in C++ and Python test CI jobs (#18175) @Matt711
Add fmt to the test/benchmarks env (#18173) @vyasr
Fix merge(how=left, lefton=, rightindex=True, sort=True) (#18166) @mroeschke
Allow nonnative cupy dtype in cudf.Series (#18164) @mroeschke
Fix Series construction from numpy array with non-native byte order (#18151) @mroeschke
Use protocol for dlpack instead of deprecated function in cupy notebook (#18147) @Matt711
Skip failing test (#18146) @vyasr
Update calls to KvikIO's config setter (#18144) @kingcrimsontianyu
Reduce memory use when writing tables with very short columns to ORC (#18136) @vuule
Handle empty dictionary in toarrowdevice interop (#18121) @davidwendt
Allow pivot_table to accept single label index and column arguments (#18115) @mroeschke
Preserve DataFrame.column subclass and type during binop (#18113) @mroeschke
Fix rmm macro call (#18108) @pmattione-nvidia
Add include for <functional> (#18102) @miscco
Remove static column vectors from window function tests. (#18099) @mythrocks
Fix scatterbymap with spilling enabled (#18095) @mroeschke
Use the right version macro CCCL_MAJOR_VERSION (#18073) @miscco
Fix test_scan_csv_multi cudf-polars test (#18064) @rjzamora
Fix memcopy direction for concatenate (#18058) @tgujar
Fix upstream dask loc test (#18045) @rjzamora
Fix hang on invalid UTF-8 data in string_view iterator (#18039) @davidwendt
Fix dask_cudf.to_orc deprecation (#18038) @rjzamora
Compatibility with dask.dataframe's is_scalar (#18030) @TomAugspurger
Fix the build error due to KvikIO update (#18025) @kingcrimsontianyu
Fix failing ibis test (#18022) @Matt711
Skip failing polars tests (#18015) @Matt711
Fix to_arrow to return consistent pandas-metadata (#18009) @galipremsagar
Prevent setting custom attributes to ColumnMethods (#18005) @galipremsagar
Compatibility with Dask main (#17992) @TomAugspurger
[Bug] Fix Parquet-metadata sampling in cudf-polars (#17991) @rjzamora
Add missing include for calling std::iota() (#17983) @davidwendt
Fix pickle and unpickling for all objects (#17980) @galipremsagar
Install duckdb the default backend for ibis in the cudf.pandas integration tests (#17972) @Matt711
Check null count too in sum aggregation (#17964) @Matt711
Raise NotImplementedError for groupby.agg if duplicate columns would be created (#17956) @mroeschke
Ensure disabling the module accelerator is thread-safe (#17955) @vyasr
Fix DataFrame/Series.rank for int and null data in mode.pandas_compatible (#17954) @mroeschke
Limit buffer size in reallocation policy in JSON reader (#17940) @shrshi
Make cudf.pandas proxy array picklable (#17929) @Matt711
Add missing standard includes (#17928) @miscco
Fix torch integration test (#17923) @Matt711
Fix to_pandas writable bug for datetime and timedelta types (#17913) @galipremsagar
Raise NotImplementedError if .merge(suffixes=) introduces duplicate labels (#17905) @mroeschke
Fix groupby scans with int and NA data in mode.pandas_compatible (#17895) @mroeschke
Patch __init__ of cudf constructors to parse through cudf.pandas proxy objects (#17878) @galipremsagar
Fixed incorrect PTX parsing of ret instruction after branch label (#17859) @lamarrr
Relax inconsistent schema handling in dask_cudf.read_parquet (#17554) @rjzamora

📖 Documentation

Clarify that cudf.pandas should be enabled before importing pandas. (#18339) @bdice
[DOC] Add wordpiece tokenizer to cudf documentation (#18247) @davidwendt
Added pylibcudf.contiguous_split to API docs (#18194) @TomAugspurger
Fix build.sh docs for default behavior (#18180) @bdice
Update Dask-cuDF documentation to fix all warnings and errors (#18157) @TomAugspurger
[DOC] Document character normalizer (#18125) @Matt711

🚀 New Features

Add and revise experimental cudf-polars config options (#18284) @rjzamora
Support top-k and bottom_k expressions (#18222) @Matt711
Support cudf-polars is_leap_year (#18212) @brandon-b-miller
Support cudf-polars month_start/month_end (#18211) @brandon-b-miller
Support cudf-polars ordinal_day (#18152) @brandon-b-miller
Add pylibcudf.gpumemoryview support for len()/nbytes (#18133) @pentschev
Link to libzstd for ZSTD compression and decompression APIs (#18129) @shrshi
Added NDSH Q09 Benchmark for Transforms (#18127) @lamarrr
Make pylibcudf traits raise exceptions gracefully rather than terminating in C++ (#18117) @Matt711
Host decompression (#18114) @vuule
Add owning types to hold Arrow data (#18084) @vyasr
Bump polars version to <1.24 (#18076) @Matt711
Support sorted merges in cudf.polars (#18075) @Matt711
Add a slice expression to polars IR (#18050) @Matt711
Expose num_rows_per_source (IO metadata) to pylibcudf (#18049) @Matt711
Added Imbalanced Tree Benchmarks for Transforms (#18032) @lamarrr
Run the narwhals test suite with cudf.pandas (#18031) @Matt711
Add host_read_async interfaces to datasource (#18018) @vuule
Make most cudf-polars Node objects pickleable (#17998) @rjzamora
Add Column.serialize to cudf-polars (#17990) @rjzamora
Bump polars version to <1.23 (#17986) @Matt711
Implemented Decimal Transforms (#17968) @lamarrr
Introduce ZSTD host-side compression and decompression APIs (#17935) @shrshi
Add catboost integration tests (#17931) @Matt711
[FEA] Expose stripe_size_rows setting for ORCWriterOptions (#17927) @ustcfy
Test narwhals in CI (#17884) @bdice
Added Multi-input & Scalar Support for Transform UDFs (#17881) @lamarrr
Host Snappy compression (#17824) @vuule
Run spark-rapids-jni CI (#17781) @KyleFromNVIDIA
Add multi-partition Shuffle operation to cuDF Polars (#17744) @rjzamora
Added polynomials benchmark (#17695) @lamarrr
Add stream parameters in pylibcudf IO APIs (#17620) @Matt711
New nvtext::wordpiece_tokenizer APIs (#17600) @davidwendt
Add support for unary negation operator (#17560) @Matt711
Add multi-partition Join support to cuDF-Polars (#17518) @rjzamora
Add basic multi-partition GroupBy support to cuDF-Polars (#17503) @rjzamora
Support Distributed in cudf-polars tests and IR evaluation (#17364) @pentschev

🛠️ Improvements

Use pyarrow 15 in oldest dependency CI jobs (#18409) @bdice
Bump librdkafka to 2.8.0 (#18370) @raydouglass
fix(rattler): ignore libzlib run dependency to avoid pandoc collision (#18368) @gforsyth
Fix zstd build interface include definition (#18366) @trxcllnt
test: Install pytest-env and hypothesis in test_narwhals.sh (#18337) @MarcoGorelli
Remove unused group_range_rolling_window API (#18313) @wence-
Cache column view creation from arrow types (#18302) @vyasr
Split Narwhals cudf.pandas tests failures into to fix and to skip (#18267) @mroeschke
Support BinOp, min, and max Aggregations in cudf-polars parallel groupby (#18266) @TomAugspurger
Minor clean up and optimizations in the Parquet writer (#18258) @vuule
Fix cudf_kafka run export for cudatoolkit (#18245) @gforsyth
dask-polars: use splat everywhere. (#18243) @madsbk
Remove cudf.Scalar from binops (#18240) @mroeschke
Remove warning in the stream pool when asking for more streams than available (#18236) @vuule
Explain why we disable parallelism for profiler tests to avoid pytest-cov issue (#18234) @Matt711
Ignore cudatoolkit run exports by name, not package (#18230) @gforsyth
Revert "Bump nightly check limit" (#18227) @Matt711
Fix cudf.pandas to be able to work on a cpu-only machine (#18224) @galipremsagar
Add missing cudatoolkit run_export ignore to pylibcudf (#18223) @gforsyth
Remove cudf.Scalar from Column.setitem (#18221) @mroeschke
Remove unused rounduppow2 utility (#18218) @PointKernel
Add flake8-print/debugger Ruff rules (#18217) @mroeschke
Bump polars version to <1.25 (#18209) @Matt711
Export RAPIDSARTIFACTSDIR. (#18208) @bdice
Drop more thrust functions with libcu++ ones (#18207) @miscco
Update Numpy <2.1 unpinning xfail condition (#18203) @mroeschke
Run conda import tests on Python packages (#18197) @bdice
fix(rattler): add cudatoolkit ignore run export to cudf (#18195) @gforsyth
Revert "Disable ARM CI in C++ and Python test CI jobs" (#18188) @Matt711
Define Column.where to be used across DataFrame/Series (#18186) @mroeschke
Remove cudf.Scalar in where (#18178) @mroeschke
Drop unnecessary fmt dep (#18177) @vyasr
Refactor join internals: separate hash_join declaration and cleanup (#18170) @PointKernel
Add Ruff rule to enforce cudf dtype utils over numpy/pandas dtype utils (#18169) @mroeschke
Combine multiple str.minhash() APIs into one call (#18168) @davidwendt
Move nanoarrowutils.hpp from cpp/tests/interop to cpp/include/cudftest (#18163) @davidwendt
Test cudf against the latest stable branch of Narwhals (#18162) @Matt711
fix libcudf pins cu11 (#18161) @gforsyth
Combine separate ConfigureNVBench calls to fix cpp conda builds (#18155) @gforsyth
Add telemetry to build workflows (#18154) @gforsyth
Prune more seldom used dtype utils (#18150) @mroeschke
Remove some unnecessary module imports (#18143) @mroeschke
Branch 25.04 merge branch 25.02 (#18142) @vyasr
Prune some seldom used dtype utils (#18141) @mroeschke
Use more, cheaper dtype checking utilities in cudf Python (#18139) @mroeschke
Support deserializing cudf-polars objects composed of RMM frames (#18138) @pentschev
Add ConfigOptions convenience class to cudf-polars (#18137) @rjzamora
Support new callback API for lazyframe.profile (#18132) @wence-
Optimized compilation of CUDFTESTUTIL's interface sources (#18131) @lamarrr
Unpin numpy<2.1 (#18128) @mroeschke
Use cpu16 for build CI jobs (#18124) @bdice
Remove now non-existent job (#18123) @vyasr
Minor typo fix in filling.pxd (#18120) @davidwendt
Replace more deprecated CUB functors (#18119) @miscco
Simplify DecimalDtype and DecimalColumn operations (#18111) @mroeschke
Add interop support from arrow StringView to libcudf strings column (#18107) @davidwendt
Expose the Number of Filtered Parquet Rowgroups (IO Metadata) to pylibcudf (#18106) @JigaoLuo
Add a list of expected failures to narwhals tests (#18097) @Matt711
Remove unused var (#18096) @vyasr
Run narwhals tests nightly. (#18093) @bdice
Use conda-build instead of conda-mambabuild (#18092) @bdice
Remove static configure step (#18091) @vyasr
Remove FindCUDAToolkit.cmake from .pre-commit-config.yaml (#18087) @KyleFromNVIDIA
Align StringColumn constructor with ColumnBase base class (#18086) @mroeschke
Remove FindCUDAToolkit backport (#18081) @KyleFromNVIDIA
Support melt(ignore_index=False) (#18080) @mroeschke
Update numba dep and upper-bound numpy (#18078) @vyasr
Add as_proxy_object API to cudf.pandas (#18072) @galipremsagar
Enforce deprecation of dtype parameter in sum/product (#18070) @mroeschke
send sccache logs to telemetry (#18069) @msarahan
Short circuit Index.equal if compared Index isn't same type (#18067) @mroeschke
Make Column.view/cancastsafely accept a dtype object (#18066) @mroeschke
Optimization improvement for substr in cudf::string_view (#18062) @davidwendt
Forward-merge branch-25.02 to branch-25.04 (#18061) @bdice
Port all conda recipes to rattler-build (#18054) @gforsyth
Minor improvements in arrow interop (#18053) @wence-
Pass more dtype objects to astype calls (#18044) @mroeschke
Forward merge branch-25.02 to branch-25.04 (#18041) @Matt711
Replace deprecated CCCL features (#18036) @miscco
Separate stats filtering helpers to reuse in page pruning (#18034) @mhaseeb123
Update spark-rapids-jni CI image version to cuda12.8.0 (#18024) @pxLi
Add pylibcudf.Scalar.from_numpy for bool/int/float/str types (#18020) @mroeschke
Support IntervalDtype(subtype=None) (#18017) @mroeschke
Enable pytest-xdist runs for py-polars tests (#18016) @galipremsagar
consolidate more conda solves in CI (#18014) @jameslamb
Replace cub::Int2Type with cuda::std::integral_constant (#18013) @miscco
Remove deprecated single component datetime extract APIs (#18010) @Matt711
Pass dtype objects to Column.astype (#18008) @mroeschke
Require CMake 3.30.4 (#18007) @robertmaynard
Refactor math_ops.cu dispatcher logic (#18006) @davidwendt
Move cudf::lists::detail::makeemptylists_column to public API (#17996) @davidwendt
Create Conda CI test env in one step (#17995) @KyleFromNVIDIA
Add seed parameter to cudf hashcharacterngrams (#17994) @davidwendt
Remove deprecated rolling window functionality (#17993) @wence-
Continue on failures in cudf.pandas integration tests CI job (#17987) @Matt711
Avoid cudf.dtype calls in buildcolumn/columnempty/.where (#17979) @mroeschke
Ensure dtype objects are passed within Column.astype (#17978) @mroeschke
Use Conda XGBoost (#17959) @jakirkham
Read the footers in parallel when reading multiple Parquet files (#17957) @vuule
Refactor predicate pushdown to reuse row group pruning in experimental PQ reader (#17946) @mhaseeb123
Add new nvtext tokenized minhash API (#17944) @davidwendt
Use shared-workflows branch-25.04 (#17943) @bdice
Get rid of the deprecated thrust::identity (#17942) @PointKernel
Remove deprecated nvtext::minhash_permuted APIs (#17939) @davidwendt
Enable third party library integration tests in CI with cudf.pandas (#17936) @galipremsagar
Add build_type input field for test.yaml (#17925) @gforsyth
Remove cudf.Scalar from shift/fillna (#17922) @mroeschke
Enabling cross join in cudf python (#17921) @galipremsagar
Use rapids-pip-retry in CI jobs that might need retries (#17920) @gforsyth
More avoid cudf.dtype internally in favor of pre-defined, supported types (#17918) @mroeschke
Initialize inout parameter (#17911) @miscco
Remove dataframe protocol (#17909) @vyasr
Rename PascalCase functions and types to to snake_case to improve consistency (#17908) @vuule
Use new rapids-logger library (#17899) @vyasr
Add pylibcudf.Scalar.from_py for construction from Python strings, bool, int, float (#17898) @mroeschke
Remove cudf.Scalar from factorize (#17897) @mroeschke
disallow fallback to Make in Python builds (#17894) @jameslamb
Remove orc::gpu namespace (#17891) @vuule
Only run Auto Assign PR workflow if PR is not merged (#17888) @mroeschke
Update pre-commit-hooks to version 0.6.0 (#17887) @KyleFromNVIDIA
Forward-merge branch-25.02 to branch-25.04 (#17885) @bdice
Add script to run pylibcudf tests (#17882) @bdice
Migrate to NVKS for amd64 CI runners (#17877) @bdice
Fix merge conflict for branch-25.02 into branch-25.04 (#17874) @davidwendt
Remove decimal32/64 to decimal128 conversion in Parquet writer (#17869) @mhaseeb123
Expose JSON reader options to builder in pylibcudf (#17866) @shrshi
Remove cudf.Scalar from .dt timedelta properties (#17863) @mroeschke
Added support for custom types in PTX parser (#17861) @lamarrr
Remove cudf.Scalar from daterange/todatetime (#17860) @mroeschke
Avoid cudf.dtype internally in favor of pre-defined, supported types (#17839) @mroeschke
Allow cudf::typetoid<T const>() (#17831) @esoha-nvidia
Fixing auto-merge branch-25.02 into branch-25.04 (#17828) @davidwendt
Add new nvtext::normalize_characters API (#17818) @davidwendt
Include more information in error messages in the nvcomp adapter (#17814) @vuule
Extend and simplify API for calculation of range-based rolling window offsets (#17807) @wence-
More minor fixes for CCCL (#17793) @miscco
Use KvikIO to enable file's fast host read and host write (#17764) @kingcrimsontianyu
Remove cudf._lib.column in favor of pylibcudf. (#17760) @mroeschke
Replaced std::string with std::string_view and removed excessive copies in cudf::io (#17734) @lamarrr
Use xdist worksteal on the cudf.pandas test suite (#16930) @Matt711

- C++
Published by AyodeAwe about 1 year ago

https://github.com/rapidsai/cudf - [NIGHTLY] v25.06.00

🔗 Links

🚨 Breaking Changes

Promote Parquet type enums to enum classes (#18441) @mhaseeb123
Move parquet schema types and structs to public headers (#18424) @mhaseeb123
Start removal of vector factories with _sync suffix by deprecating them and adding versions without the suffix (#18414) @vuule
Skip decoding of pages marked as pruned in PQ reader (#18347) @mhaseeb123
Deprecate nvtext subword tokenizer (#18334) @davidwendt
Add standard data ingestion pipelines to pylibcudf for ndarrays (#18311) @Matt711
Add Keep Option Parameter to Distinct (#18237) @warrickhe

🐛 Bug Fixes

Fix cpp examples cmake to use the rapids_config.cmake (#18501) @davidwendt
Rename rapidsmp to rapidsmpf (#18493) @rjzamora
Fix compilation with the C++20 standard (#18486) @vuule
Fix an error when reading some compressed Parquet V2 files (#18478) @vuule
Ensure DataFrame column label operations reset label_dtype (#18452) @mroeschke
Fix a segfault when reading a Parquet file with unsupported compression type (#18451) @vuule
Fix logger macros (#18444) @vyasr
Use delete not free to release data allocated with new (#18412) @wence-
Fix synchronization issues in host compression and decompression (#18395) @vuule
Update Dask array-conversion handling (#18382) @rjzamora
Fixed indexing on empty DataFrame with no columns (#18381) @TomAugspurger
Deterministic hashing for DataFrameScan nodes in cudf-polars multi-partition executor (#18351) @TomAugspurger
Fix index of right table in unary operators in AST, in Joins (#18333) @karthikeyann
Add offsetalator to contiguous-split (#18312) @davidwendt
Support large strings in nvtext vocabulary-tokenizer (#18283) @davidwendt

📖 Documentation

[DOC] Improve clarity in parquet APIs setrowgroups and set_columns parquet (#18466) @Matt711
Add a usage page to cudf-polars documentation (#18460) @Matt711
[DOC] Fix typo in CONTRIBUTING.md on build type tests (#18456) @JigaoLuo
Add restart kernel note in cudf pandas docs (#18374) @ncclementi

🚀 New Features

Support reading from device buffers in the pylibcudf IO APIs (#18496) @Matt711
Move parquet schema types and structs to public headers (#18424) @mhaseeb123
Add optional dtype argument to Scalar.from_any (#18415) @Matt711
Expose cudf::chunked_pack in pylibcudf (#18411) @wence-
Add support for long string columns in cudf::contiguous_split (#18393) @nvdbaranec
Automatically dispatch between host and device decompression/compression based on the number of buffers (#18363) @vuule
Skip decoding of pages marked as pruned in PQ reader (#18347) @mhaseeb123
Support constructing pylibcudf Columns and Tables from views into arbitrary objects (#18314) @vyasr
Add standard data ingestion pipelines to pylibcudf for ndarrays (#18311) @Matt711
Support cudf-polars isoyear and week (isoweek) (#18265) @brandon-b-miller
Add Keep Option Parameter to Distinct (#18237) @warrickhe
Add rapidsmp shuffle support to cudf-polars (#18231) @rjzamora
Support cudf-polars strftime (#18181) @brandon-b-miller
Support include_file_paths in cudf polars (#18057) @Matt711

🛠️ Improvements

Optimize pandas metadata generation to reduce memory pressure (#18505) @galipremsagar
Add pylibcudf.Column.fromrmmbuffer (#18502) @mroeschke
Replace thrust functors with libcu++ ones (#18500) @miscco
Rename cudf-polars executors (#18499) @rjzamora
Remove casting functions in pylibcudf utils (#18497) @Matt711
Increase wheel size limit. (#18487) @bdice
Split join header (#18484) @shrshi
Fix unspecified behavior involving move semantics and order of evaluation (#18481) @kingcrimsontianyu
Rerun flaky pytests in CI (#18476) @galipremsagar
Vendor RAPIDS.cmake (#18473) @bdice
Add ARM conda environments. (#18470) @bdice
Bump polars version to <1.28 (#18469) @Matt711
Promote Parquet type enums to enum classes (#18441) @mhaseeb123
Update compression formats supported in JSON reader (#18438) @shrshi
Disabled Jitify Minification (#18436) @lamarrr
Replace direct use of nvCOMP and of its adapter with the higher-level decompression API (#18434) @vuule
Test against stable tags for narwhals (#18431) @Matt711
Refcount-based dropping of cached evaluations in cudf-polars executor (#18430) @wence-
Replace Thrust iterator facilities with libcu++ ones (#18427) @miscco
Remove numpy requirement when converting 2d cuda array interface objects to pylibcudf Columns (#18426) @Matt711
Switch the ptr type in gpumemoryview from Pyssizet to uintptr_t (#18419) @Matt711
Add strings::extract_single API (#18417) @davidwendt
Start removal of vector factories with _sync suffix by deprecating them and adding versions without the suffix (#18414) @vuule
Allow polars arrow conversion to produce string_view (#18413) @wence-
Add rank and label_bin methods to ColumnBase (#18407) @mroeschke
Automatic single-partition fallback in cudf-polars (#18405) @rjzamora
Remove _sync suffix from hostdevice types (#18404) @vuule
Use owning Arrow types in C++ to expose data to Python (#18402) @vyasr
add static push and pop methods to NvtxRange (#18401) @zpuller
Deprecate cudf.Scalar (#18394) @mroeschke
Bump polars version to <1.27 (#18387) @Matt711
Branch 25.06 merge 25.04 (#18380) @Matt711
Silence warning by setting BUILDSHAREDLIBS (#18371) @vyasr
Pass stream through when taking ownership from libcudf (#18367) @wence-
Avoid patching sort algorithms from CCCL (#18364) @miscco
Deprecate old nvtext::normalize_characters (#18360) @davidwendt
refactor(rattler): enable strict channel priority for builds (#18358) @gforsyth
Optimize sequences by introducing make_offsets_child_column (#18357) @ustcfy
Decompress all data in a single decompress_page_data when reading Parquet input in a single chunk (#18352) @vuule
Performance improvement for tolower/toupper for multi-byte UTF-8 characters (#18345) @davidwendt
Branch 25.06 merge branch 25.04 (#18344) @vyasr
Use dask-cuda for cudf-polars experimental testing (#18343) @rjzamora
Deprecate nvtext subword tokenizer (#18334) @davidwendt
Remove cudf.Scalar in as_column (#18331) @mroeschke
Allow cudf.DataFrame.from_pylibcudf to accept a pylibcudf.io.TableWithMetadata (#18319) @mroeschke
Avoid stateful construction in DataFrame.__init__ (#18306) @mroeschke
Improve the groupby performance for extremely low cardinality (#18290) @PointKernel
Require type annotations in cudf.polars (#18285) @TomAugspurger
Removing unnecessary StreamSynchronization in reading (#18279) @JigaoLuo
Use the mapped buffer for all read operations in the memory-mapped source; switch default source to the kvikIO one (#18204) @vuule
Improve test coverage in the catboost integration tests (#18126) @Matt711
Create file sources in parallel (#18094) @vuule

- C++
Published by rapids-bot[bot] about 1 year ago

https://github.com/rapidsai/cudf - v25.02.02

🚨 Breaking Changes

Expose stream-ordering in scalar and avro APIs (#17766) @shrshi
Add seed parameter to hashcharacterngrams (#17643) @davidwendt
Performance improvements and simplifications for fixed size row-based rolling windows (#17623) @wence-
Refactor distinct hash join to handle multiple probes with the same build table (#17609) @PointKernel
Deprecate cudf::groupedtimerangerollingwindow (#17589) @wence-
Remove "legacy" Dask DataFrame support from Dask cuDF (#17558) @rjzamora
Return empty result for segmented_reduce if input and offsets are both empty (#17437) @davidwendt
Rework minhash APIs for deprecation cycle (#17421) @davidwendt
Change indices for dictionary column to signed integer type (#17390) @davidwendt

🐛 Bug Fixes

Use protocol for dlpack instead of deprecated function (#18134) @vyasr
Skip the failing connectorx polars tests (#18037) @Matt711
Fix 'Unexpected short subpass' exception in parquet chunked reader. (#18019) @nvdbaranec
Fix race check failures in shared memory groupby (#17985) @PointKernel
Pin ibis version in the cudf.pandas integration tests <10.0.0 (#17975) @Matt711
Fix the index type in the indexing operator of the span types (#17971) @vuule
Add missing pin (#17915) @vyasr
Fix third-party cudf.pandas tests (#17900) @galipremsagar
Fix numpy data access by making attribute private (#17890) @galipremsagar
Remove extra local var declaration from cudf.pandas 3rd-party integration shell script (#17886) @Matt711
Move isinstance_cudf_pandas to fast_slow_proxy (#17875) @galipremsagar
Make _Series_dtype method a property (#17854) @Matt711
Fix the bug in determining the heuristics for shared memory groupby (#17851) @PointKernel
Fix possible OOB mem access in Parquet decoder (#17841) @mhaseeb123
Require batches to be non-empty in multi-batch JSON reader (#17837) @shrshi
Fix rolling(minperiods=) with int and null data with mode.pandascompat (#17822) @mroeschke
Resolve race-condition in disable_module_accelerator (#17811) @galipremsagar
Make Series(dtype=object) raise in mode.pandas_compat with non string data (#17804) @mroeschke
Disable intended disabled ORC tests (#17790) @davidwendt
Fix empty DataFrame construction not returning RangeIndex columns (#17784) @mroeschke
Fix various .str methods for pandas compatability (#17782) @mroeschke
Fix count API issue about ignoring nan values (#17779) @galipremsagar
Add numba pinning to cudf repo (#17777) @galipremsagar
Allow .sortvalues(naposition=) to include NaNs in mode.pandas_compatible (#17776) @mroeschke
allow deselecting nvcomp wheels (#17774) @jameslamb
Use the aligned_resource_adaptor to allocate bloom filter device buffers (#17758) @mhaseeb123
Avoid instantiating bloom filter query function for nested and bool types (#17753) @mhaseeb123
Fix DataFrame.merge(Series, how="left"/"right") on column and index not resulting in a RangeIndex (#17739) @mroeschke
[BUG] xfail Polars excel test (#17731) @Matt711
Require to implement AutoCloseable for the classes derived from HostUDFWrapper (#17727) @ttnghia
Remove jlowe as a java committer since he retired (#17725) @tgravescs
Prevent use of invalid grid sizes in ORC reader and writer (#17709) @vuule
Enforce schema for partial tables in multi-source multi-batch JSON reader (#17708) @shrshi
Compute and use the initial string offset when building nested large string cols with chunked parquet reader (#17702) @mhaseeb123
Fix writing of compressed ORC files with large stripe footers (#17700) @vuule
Fix cudf.polars sum of empty not equalling zero (#17685) @mroeschke
Fix formatting in logging (#17680) @vuule
convert all nulls to nans in a specific scenario (#17677) @galipremsagar
Define cudf repr methods on the Column (#17675) @mroeschke
Fix groupby.len with null values in cudf.polars (#17671) @mroeschke
Fix: DataFrameGroupBy.get_group was raising with length>1 tuples (#17653) @MarcoGorelli
Fix possible int overflow in computemixedjoinoutputsize (#17633) @davidwendt
Fix a minor potential i32 overflow in thrust::transform_exclusive_scan in PQ reader preprocessing (#17617) @mhaseeb123
Fix failing xgboost test in the cudf.pandas third-party integration tests (#17616) @Matt711
Fix dask_cudf.read_csv (#17612) @rjzamora
Fix memcheck error in ReplaceTest.NormalizeNansAndZerosMutable gtest (#17610) @davidwendt
Correctly accept a pandas.CategoricalDtype(pandas.IntervalDtype(...), ...) type (#17604) @mroeschke
Add ability to modify and propagate names of columns object (#17597) @galipremsagar
Ignore NaN correctly in .quantile (#17593) @mroeschke
Fix groupby argmin/max gather of sorted-order indices (#17591) @davidwendt
Fix ctest fail running libcudf tests in a Debug build (#17576) @davidwendt
Specify a version for rapids_logger dependency (#17573) @jlowe
Fix the ORC decoding bug for the timestamp data (#17570) @kingcrimsontianyu
[JNI] remove rmm argument to set rw access for fabric handles (#17553) @abellina
Document undefined behavior in divroundingup_safe (#17542) @davidwendt
Fix nvcc-imposed UB in constexpr functions (#17534) @vuule
Add anonymous namespace to libcudf test source (#17529) @davidwendt
Propagate failures in pandas integration tests and Skip failing tests (#17521) @Matt711
Fix libcudf compile error when logging is disabled (#17512) @davidwendt
Fix Dask-cuDF clip APIs (#17509) @rjzamora
Fix pylibcudf to_arrow with multiple nested data types (#17504) @mroeschke
Fix groupby(as_index=False).size not reseting index (#17499) @mroeschke
Revert "Temporarily skip tests due to dask/distributed#8953" (#17492) @Matt711
Workaround for a misaligned access in read_csv on some CUDA versions (#17477) @vuule
Fix some possible thread-id overflow calculations (#17473) @davidwendt
Temporarily skip tests due to dask/distributed#8953 (#17472) @wence-
Detect mismatches in begin and end tokens returned by JSON tokenizer FST (#17471) @shrshi
Support dask>=2024.11.2 in Dask cuDF (#17439) @rjzamora
Fix write_json failure for zero columns in table/struct (#17414) @karthikeyann
Fix Debug-mode failing Arrow test (#17405) @zeroshade
Fix all null list column with missing child column in JSON reader (#17348) @karthikeyann

📖 Documentation

Fix forward merge 24.12->25.02 (#18002) @raydouglass
Fix incorrect example in pylibcudf docs (#17912) @Matt711
Explicitly call out that the GPU open beta runs on a single GPU (#17872) @taureandyernv
Update cudf.pandas colab link in docs (#17846) @taureandyernv
[DOC] Make pylibcudf docs more visible (#17803) @Matt711
Cross-link cudf.pandas profiler documentation. (#17668) @bdice
Document interpreter install command for cudf.pandas (#17358) @bdice
add comment to Series.tolist method (#17350) @tequilayu

🚀 New Features

Bump polars version to <1.22 (#17771) @Matt711
Make more constexpr available on device for cuIO (#17746) @PointKernel
Add public interop functions between pylibcudf and cudf classic (#17730) @Matt711
Support dask_expr migration into dask.dataframe (#17704) @rjzamora
Make tests build without relaxed constexpr (#17691) @PointKernel
Set default logger level to warn (#17684) @vyasr
Support multithreaded reading of compressed buffers in JSON reader (#17670) @shrshi
Control pinned memory use with environment variables (#17657) @vuule
Host compression (#17656) @vuule
Enable text build without relying on relaxed constexpr (#17647) @PointKernel
Implement HOST_UDF aggregation for reduction and segmented reduction (#17645) @ttnghia
Add JSON reader options structs to pylibcudf (#17614) @Matt711
Refactor distinct hash join to handle multiple probes with the same build table (#17609) @PointKernel
Add JSON Writer options classes to pylibcudf (#17606) @Matt711
Add ORC reader options structs to pylibcudf (#17601) @Matt711
Add Avro Reader options classes to pylibcudf (#17599) @Matt711
Enable binaryop build without relying on relaxed constexpr (#17598) @PointKernel
Measure the number of Parquet row groups filtered by predicate pushdown (#17594) @mhaseeb123
Implement HOST_UDF aggregation for groupby (#17592) @ttnghia
Plumb pylibcudf.io.parquet options classes through cudf python (#17506) @Matt711
Add partition-wise Select support to cuDF-Polars (#17495) @rjzamora
Add multi-partition Scan support to cuDF-Polars (#17494) @rjzamora
Migrate cudf::io::merge_row_group_metadata to pylibcudf (#17491) @Matt711
Add Parquet Reader options classes to pylibcudf (#17464) @Matt711
Add multi-partition DataFrameScan support to cuDF-Polars (#17441) @rjzamora
Return empty result for segmented_reduce if input and offsets are both empty (#17437) @davidwendt
Abstract polars function expression nodes to ensure they are serializable (#17418) @pentschev
Add CSV Reader options classes to pylibcudf (#17412) @Matt711
Add support for pylibcudf.DataType serialization (#17352) @pentschev
Enable rounding for Decimal32 and Decimal64 in cuDF (#17332) @a-hirota
Remove upper bounds on cuda-python to allow 12.6.2 and 11.8.5 (#17326) @bdice
Expose stream-ordering to groupby APIs (#17324) @shrshi
Migrate ORC Writer to pylibcudf (#17310) @Matt711
Support reading bloom filters from Parquet files and filter row groups using them (#17289) @mhaseeb123

🛠️ Improvements

Update to nvcomp 4.2.0.11 (#18042) @bdice
Remove pandas backend from cudf.pandas - ibis integration tests (#17945) @Matt711
Revert CUDA 12.8 shared workflow branch changes (#17879) @vyasr
Remove predicate param from DataFrameScan IR (#17852) @Matt711
Remove cudf.Scalar from scatter APIs (#17847) @mroeschke
Remove cudf.Scalar from interval_range (#17844) @mroeschke
Add verify-codeowners hook (#17840) @KyleFromNVIDIA
Build and test with CUDA 12.8.0 (#17834) @bdice
Increase timeout for recently added test (#17829) @galipremsagar
Apply ruff everywhere (notebooks and scripts) (#17820) @bdice
Fix pre-commit.ci failures (#17819) @bdice
Remove incorrect calls to set architectures (#17813) @vyasr
Fix typo in exception raised when attempting to convert a string column to cupy (#17800) @dagardner-nv
Add support for pyarrow-19 (#17794) @galipremsagar
increase parallelism in nightly builds (#17792) @jameslamb
Reduce libcudf memcheck tests output (#17791) @davidwendt
Make cudf build with latest CCCL (#17788) @miscco
Introduce some more rolling window benchmarks (#17787) @wence-
Add shellcheck to pre-commit and fix warnings (#17778) @gforsyth
Improve parquet reader very-long string performance (#17773) @pmattione-nvidia
Update how to manage host UDF instance (#17770) @res-life
Add getInts api for HostMemoryBuffer and UnsafeMemoryAccessor (#17767) @liurenjie1024
Expose stream-ordering in scalar and avro APIs (#17766) @shrshi
Standarize methods used from cudf.core._internals (#17765) @mroeschke
Implement string join in cudf-polars (#17755) @wence-
Deprecate dataframe protocol (#17736) @vyasr
Add parquet reader long row test (#17735) @pmattione-nvidia
Update kvikio call due to upstream changes (#17733) @kingcrimsontianyu
Delay setting MultiIndex.level/codes until needed (#17728) @mroeschke
Bounding pool size in multi-batch JSON reader (#17724) @shrshi
Use GCC 13 in CUDA 12 conda builds. (#17721) @bdice
Update minimal sphinx theme version so that we can use parallel doc builds (#17719) @vyasr
Add more aggregation methods in pylibcudf (#17717) @mroeschke
Make cudf.lib.stringudf work with pylibcudf Columns instead of cudf._lib Columns (#17715) @mroeschke
Add special orc test data: timestamp interspersed with null values (#17713) @kingcrimsontianyu
Add pylibcudf.nullmask.nullcount (#17711) @mroeschke
Ensure pyarrow.Scalar to pylibcudf.Scalar is cached (#17707) @mroeschke
Adapt cudf numba config for numba 0.61 removal (#17705) @mroeschke
Remove cudf._lib.scalar in favor of pylibcudf (#17701) @mroeschke
Fix parquet reader list bug (#17699) @pmattione-nvidia
Migrated Dynamic AST Expression Trees in Benchmarks and Tests to use AST Tree (#17697) @lamarrr
Skip polars test that can generate timezones that chrono_tz doesn't know (#17694) @wence-
Use 64-bit offsets only if the current strings column output chunk size exceeds threshold (#17693) @mhaseeb123
Use latest ci-conda images (#17690) @bdice
Add multi-source reading to JSON reader benchmarks (#17688) @shrshi
Convert cudf.Scalar usage to pylibcudf and pyarrow usage (#17686) @mroeschke
remove find_package(Python) in libcudf build (#17683) @jameslamb
Fix build metrics report format with long placehold filenames (#17679) @davidwendt
Use rapids-cmake for the logger (#17674) @vyasr
Java Parquet reads via multiple host buffers (#17673) @jlowe
Remove cudf._libs.types.pyx (#17665) @mroeschke
Add support for Groupby.cumprod (#17661) @galipremsagar
Implement .dt.total_seconds (#17659) @galipremsagar
Avoid shallow copies in groupby methods (#17646) @mroeschke
Avoid double MultiIndex factorization in groupby index result (#17644) @mroeschke
Add seed parameter to hashcharacterngrams (#17643) @davidwendt
Fix possible overflow in WriteCoalescingCallbackWrapper::TearDown (#17642) @davidwendt
Remove pragma GCC diagnostic from source files (#17637) @davidwendt
Move unnecessary utilities from cudf._lib.scalar (#17636) @mroeschke
Support compression= in DataFrame.to_json (#17634) @mroeschke
Bump Polars version to <1.18 (#17632) @Matt711
Add public APIs to Access Underlying cudf and pandas Objects from cudf.pandas Proxy Objects (#17629) @galipremsagar
Use Numba Config to turn on Pynvjitlink Features (#17628) @isVoid
Use PyNVML 12 (#17627) @jakirkham
Remove cudf._lib.utils in favor of python APIs (#17625) @mroeschke
Performance improvements and simplifications for fixed size row-based rolling windows (#17623) @wence-
Fix return types for MurmurHash3x8632 template specializations (#17622) @davidwendt
Clean up namespaces and improve compression-related headers (#17621) @vuule
Use more pylibcudf.types instead of cudf._lib.types (#17619) @mroeschke
Remove patch that is only needed for clang-tidy to run on test files (#17618) @vyasr
update telemetry actions to fluent-bit friendly style (#17615) @msarahan
Introduce some simple benchmarks for rolling window aggregations (#17613) @wence-
Bump the oldest pyarrow version to 14.0.2 in test matrix (#17611) @galipremsagar
Use [[nodiscard]] attribute before __device__ (#17608) @vuule
Use host_vector in flatten_single_pass_aggs (#17605) @vuule
Stop memory_resource.hpp from including itself (#17603) @vyasr
Replace the outdated cuco window concept with buckets (#17602) @PointKernel
Check if nightlies have succeeded recently enough (#17596) @vyasr
Deprecate cudf::groupedtimerangerollingwindow (#17589) @wence-
A couple of fixes in rapids-logger usage (#17588) @vyasr
Simplify expression transformer in Parquet predicate pushdown with ast::tree (#17587) @mhaseeb123
Remove unused functionality in cudf._lib.utils.pyx (#17586) @mroeschke
Use cuda-python cuda.bindings import names. (#17585) @bdice
Use no-sync copy for fixed-width types in cudf::concatenate (#17584) @davidwendt
Remove cudf._lib.groupby in favor of inlining pylibcudf (#17582) @mroeschke
Remove unused code of json schema in JSON reader (#17581) @karthikeyann
Expose Scalar's constructor and Scalar#getScalarHandle() to public (#17580) @ttnghia
Allow large strings in nvtext benchmarks (#17579) @davidwendt
Remove cudf._lib.reduce in favor of inlining pylibcudf (#17574) @mroeschke
Use batched memcpy when writing ORC statistics (#17572) @vuule
Allow large strings in nvbench strings benchmarks (#17571) @davidwendt
Update version references in workflow (#17568) @AyodeAwe
Enable all json reader options in pylibcudf read_json (#17563) @karthikeyann
Remove cudf._lib.parquet in favor of inlining pylibcudf (#17562) @mroeschke
Fix CMake format in cudf/_lib/CMakeLists.txt (#17559) @mroeschke
Remove "legacy" Dask DataFrame support from Dask cuDF (#17558) @rjzamora
Replace direct cudaMemcpyAsync calls with utility functions (within /include) (#17557) @vuule
Remove cudf._lib.interop in favor of inlining pylibcudf (#17555) @mroeschke
gate telemetry dispatch calls on TELEMETRY_ENABLED env var (#17551) @msarahan
Replace direct cudaMemcpyAsync calls with utility functions (within /src) (#17550) @vuule
Remove unused BufferArrayFromVector (#17549) @Matt711
Move cudf.lib.copying to cudf.core.internals (#17548) @mroeschke
Update cuda-python lower bounds to 12.6.2 / 11.8.5 (#17547) @bdice
Fix typos, rename types, and add null_probability benchmark axis for distinct (#17546) @PointKernel
Mark more constexpr functions as device-available (#17545) @vyasr
Use cooperative-groups instead of cub warp-reduce for strings contains (#17540) @davidwendt
Remove cudf._lib.nvtext in favor of inlining pylibcudf (#17535) @mroeschke
Add XXHash_32 hasher (#17533) @PointKernel
Remove unused masked keyword in column_empty (#17530) @mroeschke
Remove Thrust patch in favor of CMake definition for Thrust 32-bit offset types. (#17527) @bdice
[JNI] Enables fabric handles for CUDA async memory pools (#17526) @abellina
Force Thrust to use 32-bit offset type. (#17523) @bdice
Replace cudf::detail::copyif logic with thrust::copyif and gather (#17520) @davidwendt
Replaces uses of cudf._lib.Column.from_unique_ptr with pylibcudf.Column.from_libcudf (#17517) @Matt711
Move cudf.lib.aggregation to cudf.core.internals (#17516) @mroeschke
Migrate copycolumn and Column.fromscalar to pylibcudf (#17513) @Matt711
Remove cudf._lib.transform in favor of inlining pylibcudf (#17505) @mroeschke
Remove cudf._lib.string.convert/split in favor of inlining pylibcudf (#17496) @mroeschke
Move cudf.lib.sort to cudf.core.internals (#17488) @mroeschke
Remove cudf._lib.csv in favor in inlining pylibcudf (#17485) @mroeschke
Update PyTorch to >=2.4.0 to get fix for CUDA array interface bug, and drop CUDA 11 PyTorch tests. (#17475) @bdice
Remove cudf._lib.binops in favor of inlining pylibcudf (#17468) @mroeschke
Remove cudf._lib.orc in favor of inlining pylibcudf (#17466) @mroeschke
skip most CI on devcontainer-only changes (#17465) @jameslamb
Set build type for all examples (#17463) @vyasr
Update the hook versions in pre-commit (#17462) @wence-
Remove cudf.lib.stringcasting in favor of inlining pylibcudf (#17460) @mroeschke
Remove cudf._lib.filling in favor of inlining pylibcudf (#17459) @mroeschke
Update MurmurHash3x64128 to use the cuco equivalent implementation (#17457) @PointKernel
Move cudf.lib.streamcompaction to cudf.core._internals (#17456) @mroeschke
Clean up xxhash_64 implementations (#17455) @PointKernel
Update Hadoop dependency in Java pom (#17454) @jlowe
Adapt to rmm logger changes (#17451) @vyasr
Require approval to run CI on draft PRs (#17450) @bdice
Expose stream-ordering in nvtext API (#17446) @shrshi
Use execpolicynosync in write_json (#17445) @karthikeyann
Remove cudf._lib.json in favor of inlining pylibcudf (#17443) @mroeschke
Remove cudf.lib.nullmask in favor of inlining pylibcudf (#17440) @mroeschke
Expose stream-ordering in replace API (#17436) @shrshi
Expose stream-ordering in copying APIs (#17435) @shrshi
Expose stream-ordering in column view APIs (#17434) @shrshi
Apply clang-tidy autofixes from new rules (#17431) @vyasr
Remove cudf._lib.round in favor of inlining pylibcudf (#17430) @mroeschke
Update MurmurHash3x8632 to use the cuco equivalent implementation (#17429) @PointKernel
Remove cudf._lib.replace in favor of inlining pylibcudf (#17428) @mroeschke
Remove nvtx/ranges.hpp include from cuda.cuh (#17427) @davidwendt
Remove the unused detail int_fastdiv.h header (#17426) @PointKernel
Remove cudf._lib.lists in favor of inlining pylibcudf (#17425) @mroeschke
Remove cudf._lib.quantile (#17424) @mroeschke
Remove cudf._lib.rolling in favor of inlining pylibcudf (#17423) @mroeschke
Avoid converting Decimal32/Decimal64 in to_arrow and from_arrow APIs (#17422) @zeroshade
Rework minhash APIs for deprecation cycle (#17421) @davidwendt
Use threadindextype in binary-ops jit kernel.cu (#17420) @davidwendt
Change binops for-each kernel to thrust::foreachn (#17419) @davidwendt
Move cudf.lib.search to cudf.core.internals (#17411) @mroeschke
Use grid1d utilities in copyrange.cuh (#17409) @davidwendt
Remove cudf._lib.text in favor of inlining pylibcudf (#17408) @mroeschke
Run clang-tidy checks in PR CI (#17407) @bdice
Update strings/text source to use grid_1d for thread/block/stride calculations (#17404) @davidwendt
Expose stream-ordering to strings attribute APIs (#17398) @shrshi
Expose stream-ordering to interop APIs (#17397) @shrshi
Remove unused type aliases (#17396) @PointKernel
Remove some cudf._lib.strings files in favor of inlining pylibcudf (#17394) @mroeschke
Update xxhash_64 to utilize the cuco equivalent implementation (#17393) @PointKernel
Change indices for dictionary column to signed integer type (#17390) @davidwendt
Return categorical values in tonumpy/tocupy (#17388) @mroeschke
Forward-merge branch-24.12 to branch-25.02 (#17379) @bdice
Remove unused IO utilities from cudf python (#17374) @Matt711
Remove cudf._lib.datetime in favor of inlining pylibcudf (#17372) @mroeschke
Remove cudf._lib.join in favor of inlining pylibcudf (#17371) @mroeschke
Remove cudf._lib.merge in favor of inlining pylibcudf (#17370) @mroeschke
Remove cudf._lib.partitioning in favor of inlining pylibcudf (#17369) @mroeschke
Remove cudf._lib.reshape in favor of inlining pylibcudf (#17368) @mroeschke
Remove cudf._lib.timezone in favor of inlining pylibcudf (#17366) @mroeschke
Remove cudf._lib.transpose in favor of inlining pylibcudf (#17365) @mroeschke
Move makestringscolumn benchmark to nvbench (#17340) @davidwendt
Improve strings contains/find performance for smaller strings (#17330) @davidwendt
Use rapids-logger to generate the cudf logger (#17307) @vyasr
Mukernels strings (#17286) @pmattione-nvidia
Add write_parquet to pylibcudf (#17263) @mroeschke
Single-partition Dask executor for cuDF-Polars (#17262) @rjzamora
Add breaking change workflow trigger (#17248) @AyodeAwe
Precompute AST arity (#17234) @bdice
Update to CCCL 2.7.0-rc2. (#17233) @bdice
Make column_empty mask buffer creation consistent with libcudf (#16715) @mroeschke

- C++
Published by raydouglass about 1 year ago

https://github.com/rapidsai/cudf - v25.02.01

🚨 Breaking Changes

Expose stream-ordering in scalar and avro APIs (#17766) @shrshi
Add seed parameter to hashcharacterngrams (#17643) @davidwendt
Performance improvements and simplifications for fixed size row-based rolling windows (#17623) @wence-
Refactor distinct hash join to handle multiple probes with the same build table (#17609) @PointKernel
Deprecate cudf::groupedtimerangerollingwindow (#17589) @wence-
Remove "legacy" Dask DataFrame support from Dask cuDF (#17558) @rjzamora
Return empty result for segmented_reduce if input and offsets are both empty (#17437) @davidwendt
Rework minhash APIs for deprecation cycle (#17421) @davidwendt
Change indices for dictionary column to signed integer type (#17390) @davidwendt

🐛 Bug Fixes

Skip the failing connectorx polars tests (#18037) @Matt711
Fix 'Unexpected short subpass' exception in parquet chunked reader. (#18019) @nvdbaranec
Fix race check failures in shared memory groupby (#17985) @PointKernel
Pin ibis version in the cudf.pandas integration tests <10.0.0 (#17975) @Matt711
Fix the index type in the indexing operator of the span types (#17971) @vuule
Add missing pin (#17915) @vyasr
Fix third-party cudf.pandas tests (#17900) @galipremsagar
Fix numpy data access by making attribute private (#17890) @galipremsagar
Remove extra local var declaration from cudf.pandas 3rd-party integration shell script (#17886) @Matt711
Move isinstance_cudf_pandas to fast_slow_proxy (#17875) @galipremsagar
Make _Series_dtype method a property (#17854) @Matt711
Fix the bug in determining the heuristics for shared memory groupby (#17851) @PointKernel
Fix possible OOB mem access in Parquet decoder (#17841) @mhaseeb123
Require batches to be non-empty in multi-batch JSON reader (#17837) @shrshi
Fix rolling(minperiods=) with int and null data with mode.pandascompat (#17822) @mroeschke
Resolve race-condition in disable_module_accelerator (#17811) @galipremsagar
Make Series(dtype=object) raise in mode.pandas_compat with non string data (#17804) @mroeschke
Disable intended disabled ORC tests (#17790) @davidwendt
Fix empty DataFrame construction not returning RangeIndex columns (#17784) @mroeschke
Fix various .str methods for pandas compatability (#17782) @mroeschke
Fix count API issue about ignoring nan values (#17779) @galipremsagar
Add numba pinning to cudf repo (#17777) @galipremsagar
Allow .sortvalues(naposition=) to include NaNs in mode.pandas_compatible (#17776) @mroeschke
allow deselecting nvcomp wheels (#17774) @jameslamb
Use the aligned_resource_adaptor to allocate bloom filter device buffers (#17758) @mhaseeb123
Avoid instantiating bloom filter query function for nested and bool types (#17753) @mhaseeb123
Fix DataFrame.merge(Series, how="left"/"right") on column and index not resulting in a RangeIndex (#17739) @mroeschke
[BUG] xfail Polars excel test (#17731) @Matt711
Require to implement AutoCloseable for the classes derived from HostUDFWrapper (#17727) @ttnghia
Remove jlowe as a java committer since he retired (#17725) @tgravescs
Prevent use of invalid grid sizes in ORC reader and writer (#17709) @vuule
Enforce schema for partial tables in multi-source multi-batch JSON reader (#17708) @shrshi
Compute and use the initial string offset when building nested large string cols with chunked parquet reader (#17702) @mhaseeb123
Fix writing of compressed ORC files with large stripe footers (#17700) @vuule
Fix cudf.polars sum of empty not equalling zero (#17685) @mroeschke
Fix formatting in logging (#17680) @vuule
convert all nulls to nans in a specific scenario (#17677) @galipremsagar
Define cudf repr methods on the Column (#17675) @mroeschke
Fix groupby.len with null values in cudf.polars (#17671) @mroeschke
Fix: DataFrameGroupBy.get_group was raising with length>1 tuples (#17653) @MarcoGorelli
Fix possible int overflow in computemixedjoinoutputsize (#17633) @davidwendt
Fix a minor potential i32 overflow in thrust::transform_exclusive_scan in PQ reader preprocessing (#17617) @mhaseeb123
Fix failing xgboost test in the cudf.pandas third-party integration tests (#17616) @Matt711
Fix dask_cudf.read_csv (#17612) @rjzamora
Fix memcheck error in ReplaceTest.NormalizeNansAndZerosMutable gtest (#17610) @davidwendt
Correctly accept a pandas.CategoricalDtype(pandas.IntervalDtype(...), ...) type (#17604) @mroeschke
Add ability to modify and propagate names of columns object (#17597) @galipremsagar
Ignore NaN correctly in .quantile (#17593) @mroeschke
Fix groupby argmin/max gather of sorted-order indices (#17591) @davidwendt
Fix ctest fail running libcudf tests in a Debug build (#17576) @davidwendt
Specify a version for rapids_logger dependency (#17573) @jlowe
Fix the ORC decoding bug for the timestamp data (#17570) @kingcrimsontianyu
[JNI] remove rmm argument to set rw access for fabric handles (#17553) @abellina
Document undefined behavior in divroundingup_safe (#17542) @davidwendt
Fix nvcc-imposed UB in constexpr functions (#17534) @vuule
Add anonymous namespace to libcudf test source (#17529) @davidwendt
Propagate failures in pandas integration tests and Skip failing tests (#17521) @Matt711
Fix libcudf compile error when logging is disabled (#17512) @davidwendt
Fix Dask-cuDF clip APIs (#17509) @rjzamora
Fix pylibcudf to_arrow with multiple nested data types (#17504) @mroeschke
Fix groupby(as_index=False).size not reseting index (#17499) @mroeschke
Revert "Temporarily skip tests due to dask/distributed#8953" (#17492) @Matt711
Workaround for a misaligned access in read_csv on some CUDA versions (#17477) @vuule
Fix some possible thread-id overflow calculations (#17473) @davidwendt
Temporarily skip tests due to dask/distributed#8953 (#17472) @wence-
Detect mismatches in begin and end tokens returned by JSON tokenizer FST (#17471) @shrshi
Support dask>=2024.11.2 in Dask cuDF (#17439) @rjzamora
Fix write_json failure for zero columns in table/struct (#17414) @karthikeyann
Fix Debug-mode failing Arrow test (#17405) @zeroshade
Fix all null list column with missing child column in JSON reader (#17348) @karthikeyann

📖 Documentation

Fix forward merge 24.12->25.02 (#18002) @raydouglass
Fix incorrect example in pylibcudf docs (#17912) @Matt711
Explicitly call out that the GPU open beta runs on a single GPU (#17872) @taureandyernv
Update cudf.pandas colab link in docs (#17846) @taureandyernv
[DOC] Make pylibcudf docs more visible (#17803) @Matt711
Cross-link cudf.pandas profiler documentation. (#17668) @bdice
Document interpreter install command for cudf.pandas (#17358) @bdice
add comment to Series.tolist method (#17350) @tequilayu

🚀 New Features

Bump polars version to <1.22 (#17771) @Matt711
Make more constexpr available on device for cuIO (#17746) @PointKernel
Add public interop functions between pylibcudf and cudf classic (#17730) @Matt711
Support dask_expr migration into dask.dataframe (#17704) @rjzamora
Make tests build without relaxed constexpr (#17691) @PointKernel
Set default logger level to warn (#17684) @vyasr
Support multithreaded reading of compressed buffers in JSON reader (#17670) @shrshi
Control pinned memory use with environment variables (#17657) @vuule
Host compression (#17656) @vuule
Enable text build without relying on relaxed constexpr (#17647) @PointKernel
Implement HOST_UDF aggregation for reduction and segmented reduction (#17645) @ttnghia
Add JSON reader options structs to pylibcudf (#17614) @Matt711
Refactor distinct hash join to handle multiple probes with the same build table (#17609) @PointKernel
Add JSON Writer options classes to pylibcudf (#17606) @Matt711
Add ORC reader options structs to pylibcudf (#17601) @Matt711
Add Avro Reader options classes to pylibcudf (#17599) @Matt711
Enable binaryop build without relying on relaxed constexpr (#17598) @PointKernel
Measure the number of Parquet row groups filtered by predicate pushdown (#17594) @mhaseeb123
Implement HOST_UDF aggregation for groupby (#17592) @ttnghia
Plumb pylibcudf.io.parquet options classes through cudf python (#17506) @Matt711
Add partition-wise Select support to cuDF-Polars (#17495) @rjzamora
Add multi-partition Scan support to cuDF-Polars (#17494) @rjzamora
Migrate cudf::io::merge_row_group_metadata to pylibcudf (#17491) @Matt711
Add Parquet Reader options classes to pylibcudf (#17464) @Matt711
Add multi-partition DataFrameScan support to cuDF-Polars (#17441) @rjzamora
Return empty result for segmented_reduce if input and offsets are both empty (#17437) @davidwendt
Abstract polars function expression nodes to ensure they are serializable (#17418) @pentschev
Add CSV Reader options classes to pylibcudf (#17412) @Matt711
Add support for pylibcudf.DataType serialization (#17352) @pentschev
Enable rounding for Decimal32 and Decimal64 in cuDF (#17332) @a-hirota
Remove upper bounds on cuda-python to allow 12.6.2 and 11.8.5 (#17326) @bdice
Expose stream-ordering to groupby APIs (#17324) @shrshi
Migrate ORC Writer to pylibcudf (#17310) @Matt711
Support reading bloom filters from Parquet files and filter row groups using them (#17289) @mhaseeb123

🛠️ Improvements

Update to nvcomp 4.2.0.11 (#18042) @bdice
Remove pandas backend from cudf.pandas - ibis integration tests (#17945) @Matt711
Revert CUDA 12.8 shared workflow branch changes (#17879) @vyasr
Remove predicate param from DataFrameScan IR (#17852) @Matt711
Remove cudf.Scalar from scatter APIs (#17847) @mroeschke
Remove cudf.Scalar from interval_range (#17844) @mroeschke
Add verify-codeowners hook (#17840) @KyleFromNVIDIA
Build and test with CUDA 12.8.0 (#17834) @bdice
Increase timeout for recently added test (#17829) @galipremsagar
Apply ruff everywhere (notebooks and scripts) (#17820) @bdice
Fix pre-commit.ci failures (#17819) @bdice
Remove incorrect calls to set architectures (#17813) @vyasr
Fix typo in exception raised when attempting to convert a string column to cupy (#17800) @dagardner-nv
Add support for pyarrow-19 (#17794) @galipremsagar
increase parallelism in nightly builds (#17792) @jameslamb
Reduce libcudf memcheck tests output (#17791) @davidwendt
Make cudf build with latest CCCL (#17788) @miscco
Introduce some more rolling window benchmarks (#17787) @wence-
Add shellcheck to pre-commit and fix warnings (#17778) @gforsyth
Improve parquet reader very-long string performance (#17773) @pmattione-nvidia
Update how to manage host UDF instance (#17770) @res-life
Add getInts api for HostMemoryBuffer and UnsafeMemoryAccessor (#17767) @liurenjie1024
Expose stream-ordering in scalar and avro APIs (#17766) @shrshi
Standarize methods used from cudf.core._internals (#17765) @mroeschke
Implement string join in cudf-polars (#17755) @wence-
Deprecate dataframe protocol (#17736) @vyasr
Add parquet reader long row test (#17735) @pmattione-nvidia
Update kvikio call due to upstream changes (#17733) @kingcrimsontianyu
Delay setting MultiIndex.level/codes until needed (#17728) @mroeschke
Bounding pool size in multi-batch JSON reader (#17724) @shrshi
Use GCC 13 in CUDA 12 conda builds. (#17721) @bdice
Update minimal sphinx theme version so that we can use parallel doc builds (#17719) @vyasr
Add more aggregation methods in pylibcudf (#17717) @mroeschke
Make cudf.lib.stringudf work with pylibcudf Columns instead of cudf._lib Columns (#17715) @mroeschke
Add special orc test data: timestamp interspersed with null values (#17713) @kingcrimsontianyu
Add pylibcudf.nullmask.nullcount (#17711) @mroeschke
Ensure pyarrow.Scalar to pylibcudf.Scalar is cached (#17707) @mroeschke
Adapt cudf numba config for numba 0.61 removal (#17705) @mroeschke
Remove cudf._lib.scalar in favor of pylibcudf (#17701) @mroeschke
Fix parquet reader list bug (#17699) @pmattione-nvidia
Migrated Dynamic AST Expression Trees in Benchmarks and Tests to use AST Tree (#17697) @lamarrr
Skip polars test that can generate timezones that chrono_tz doesn't know (#17694) @wence-
Use 64-bit offsets only if the current strings column output chunk size exceeds threshold (#17693) @mhaseeb123
Use latest ci-conda images (#17690) @bdice
Add multi-source reading to JSON reader benchmarks (#17688) @shrshi
Convert cudf.Scalar usage to pylibcudf and pyarrow usage (#17686) @mroeschke
remove find_package(Python) in libcudf build (#17683) @jameslamb
Fix build metrics report format with long placehold filenames (#17679) @davidwendt
Use rapids-cmake for the logger (#17674) @vyasr
Java Parquet reads via multiple host buffers (#17673) @jlowe
Remove cudf._libs.types.pyx (#17665) @mroeschke
Add support for Groupby.cumprod (#17661) @galipremsagar
Implement .dt.total_seconds (#17659) @galipremsagar
Avoid shallow copies in groupby methods (#17646) @mroeschke
Avoid double MultiIndex factorization in groupby index result (#17644) @mroeschke
Add seed parameter to hashcharacterngrams (#17643) @davidwendt
Fix possible overflow in WriteCoalescingCallbackWrapper::TearDown (#17642) @davidwendt
Remove pragma GCC diagnostic from source files (#17637) @davidwendt
Move unnecessary utilities from cudf._lib.scalar (#17636) @mroeschke
Support compression= in DataFrame.to_json (#17634) @mroeschke
Bump Polars version to <1.18 (#17632) @Matt711
Add public APIs to Access Underlying cudf and pandas Objects from cudf.pandas Proxy Objects (#17629) @galipremsagar
Use Numba Config to turn on Pynvjitlink Features (#17628) @isVoid
Use PyNVML 12 (#17627) @jakirkham
Remove cudf._lib.utils in favor of python APIs (#17625) @mroeschke
Performance improvements and simplifications for fixed size row-based rolling windows (#17623) @wence-
Fix return types for MurmurHash3x8632 template specializations (#17622) @davidwendt
Clean up namespaces and improve compression-related headers (#17621) @vuule
Use more pylibcudf.types instead of cudf._lib.types (#17619) @mroeschke
Remove patch that is only needed for clang-tidy to run on test files (#17618) @vyasr
update telemetry actions to fluent-bit friendly style (#17615) @msarahan
Introduce some simple benchmarks for rolling window aggregations (#17613) @wence-
Bump the oldest pyarrow version to 14.0.2 in test matrix (#17611) @galipremsagar
Use [[nodiscard]] attribute before __device__ (#17608) @vuule
Use host_vector in flatten_single_pass_aggs (#17605) @vuule
Stop memory_resource.hpp from including itself (#17603) @vyasr
Replace the outdated cuco window concept with buckets (#17602) @PointKernel
Check if nightlies have succeeded recently enough (#17596) @vyasr
Deprecate cudf::groupedtimerangerollingwindow (#17589) @wence-
A couple of fixes in rapids-logger usage (#17588) @vyasr
Simplify expression transformer in Parquet predicate pushdown with ast::tree (#17587) @mhaseeb123
Remove unused functionality in cudf._lib.utils.pyx (#17586) @mroeschke
Use cuda-python cuda.bindings import names. (#17585) @bdice
Use no-sync copy for fixed-width types in cudf::concatenate (#17584) @davidwendt
Remove cudf._lib.groupby in favor of inlining pylibcudf (#17582) @mroeschke
Remove unused code of json schema in JSON reader (#17581) @karthikeyann
Expose Scalar's constructor and Scalar#getScalarHandle() to public (#17580) @ttnghia
Allow large strings in nvtext benchmarks (#17579) @davidwendt
Remove cudf._lib.reduce in favor of inlining pylibcudf (#17574) @mroeschke
Use batched memcpy when writing ORC statistics (#17572) @vuule
Allow large strings in nvbench strings benchmarks (#17571) @davidwendt
Update version references in workflow (#17568) @AyodeAwe
Enable all json reader options in pylibcudf read_json (#17563) @karthikeyann
Remove cudf._lib.parquet in favor of inlining pylibcudf (#17562) @mroeschke
Fix CMake format in cudf/_lib/CMakeLists.txt (#17559) @mroeschke
Remove "legacy" Dask DataFrame support from Dask cuDF (#17558) @rjzamora
Replace direct cudaMemcpyAsync calls with utility functions (within /include) (#17557) @vuule
Remove cudf._lib.interop in favor of inlining pylibcudf (#17555) @mroeschke
gate telemetry dispatch calls on TELEMETRY_ENABLED env var (#17551) @msarahan
Replace direct cudaMemcpyAsync calls with utility functions (within /src) (#17550) @vuule
Remove unused BufferArrayFromVector (#17549) @Matt711
Move cudf.lib.copying to cudf.core.internals (#17548) @mroeschke
Update cuda-python lower bounds to 12.6.2 / 11.8.5 (#17547) @bdice
Fix typos, rename types, and add null_probability benchmark axis for distinct (#17546) @PointKernel
Mark more constexpr functions as device-available (#17545) @vyasr
Use cooperative-groups instead of cub warp-reduce for strings contains (#17540) @davidwendt
Remove cudf._lib.nvtext in favor of inlining pylibcudf (#17535) @mroeschke
Add XXHash_32 hasher (#17533) @PointKernel
Remove unused masked keyword in column_empty (#17530) @mroeschke
Remove Thrust patch in favor of CMake definition for Thrust 32-bit offset types. (#17527) @bdice
[JNI] Enables fabric handles for CUDA async memory pools (#17526) @abellina
Force Thrust to use 32-bit offset type. (#17523) @bdice
Replace cudf::detail::copyif logic with thrust::copyif and gather (#17520) @davidwendt
Replaces uses of cudf._lib.Column.from_unique_ptr with pylibcudf.Column.from_libcudf (#17517) @Matt711
Move cudf.lib.aggregation to cudf.core.internals (#17516) @mroeschke
Migrate copycolumn and Column.fromscalar to pylibcudf (#17513) @Matt711
Remove cudf._lib.transform in favor of inlining pylibcudf (#17505) @mroeschke
Remove cudf._lib.string.convert/split in favor of inlining pylibcudf (#17496) @mroeschke
Move cudf.lib.sort to cudf.core.internals (#17488) @mroeschke
Remove cudf._lib.csv in favor in inlining pylibcudf (#17485) @mroeschke
Update PyTorch to >=2.4.0 to get fix for CUDA array interface bug, and drop CUDA 11 PyTorch tests. (#17475) @bdice
Remove cudf._lib.binops in favor of inlining pylibcudf (#17468) @mroeschke
Remove cudf._lib.orc in favor of inlining pylibcudf (#17466) @mroeschke
skip most CI on devcontainer-only changes (#17465) @jameslamb
Set build type for all examples (#17463) @vyasr
Update the hook versions in pre-commit (#17462) @wence-
Remove cudf.lib.stringcasting in favor of inlining pylibcudf (#17460) @mroeschke
Remove cudf._lib.filling in favor of inlining pylibcudf (#17459) @mroeschke
Update MurmurHash3x64128 to use the cuco equivalent implementation (#17457) @PointKernel
Move cudf.lib.streamcompaction to cudf.core._internals (#17456) @mroeschke
Clean up xxhash_64 implementations (#17455) @PointKernel
Update Hadoop dependency in Java pom (#17454) @jlowe
Adapt to rmm logger changes (#17451) @vyasr
Require approval to run CI on draft PRs (#17450) @bdice
Expose stream-ordering in nvtext API (#17446) @shrshi
Use execpolicynosync in write_json (#17445) @karthikeyann
Remove cudf._lib.json in favor of inlining pylibcudf (#17443) @mroeschke
Remove cudf.lib.nullmask in favor of inlining pylibcudf (#17440) @mroeschke
Expose stream-ordering in replace API (#17436) @shrshi
Expose stream-ordering in copying APIs (#17435) @shrshi
Expose stream-ordering in column view APIs (#17434) @shrshi
Apply clang-tidy autofixes from new rules (#17431) @vyasr
Remove cudf._lib.round in favor of inlining pylibcudf (#17430) @mroeschke
Update MurmurHash3x8632 to use the cuco equivalent implementation (#17429) @PointKernel
Remove cudf._lib.replace in favor of inlining pylibcudf (#17428) @mroeschke
Remove nvtx/ranges.hpp include from cuda.cuh (#17427) @davidwendt
Remove the unused detail int_fastdiv.h header (#17426) @PointKernel
Remove cudf._lib.lists in favor of inlining pylibcudf (#17425) @mroeschke
Remove cudf._lib.quantile (#17424) @mroeschke
Remove cudf._lib.rolling in favor of inlining pylibcudf (#17423) @mroeschke
Avoid converting Decimal32/Decimal64 in to_arrow and from_arrow APIs (#17422) @zeroshade
Rework minhash APIs for deprecation cycle (#17421) @davidwendt
Use threadindextype in binary-ops jit kernel.cu (#17420) @davidwendt
Change binops for-each kernel to thrust::foreachn (#17419) @davidwendt
Move cudf.lib.search to cudf.core.internals (#17411) @mroeschke
Use grid1d utilities in copyrange.cuh (#17409) @davidwendt
Remove cudf._lib.text in favor of inlining pylibcudf (#17408) @mroeschke
Run clang-tidy checks in PR CI (#17407) @bdice
Update strings/text source to use grid_1d for thread/block/stride calculations (#17404) @davidwendt
Expose stream-ordering to strings attribute APIs (#17398) @shrshi
Expose stream-ordering to interop APIs (#17397) @shrshi
Remove unused type aliases (#17396) @PointKernel
Remove some cudf._lib.strings files in favor of inlining pylibcudf (#17394) @mroeschke
Update xxhash_64 to utilize the cuco equivalent implementation (#17393) @PointKernel
Change indices for dictionary column to signed integer type (#17390) @davidwendt
Return categorical values in tonumpy/tocupy (#17388) @mroeschke
Forward-merge branch-24.12 to branch-25.02 (#17379) @bdice
Remove unused IO utilities from cudf python (#17374) @Matt711
Remove cudf._lib.datetime in favor of inlining pylibcudf (#17372) @mroeschke
Remove cudf._lib.join in favor of inlining pylibcudf (#17371) @mroeschke
Remove cudf._lib.merge in favor of inlining pylibcudf (#17370) @mroeschke
Remove cudf._lib.partitioning in favor of inlining pylibcudf (#17369) @mroeschke
Remove cudf._lib.reshape in favor of inlining pylibcudf (#17368) @mroeschke
Remove cudf._lib.timezone in favor of inlining pylibcudf (#17366) @mroeschke
Remove cudf._lib.transpose in favor of inlining pylibcudf (#17365) @mroeschke
Move makestringscolumn benchmark to nvbench (#17340) @davidwendt
Improve strings contains/find performance for smaller strings (#17330) @davidwendt
Use rapids-logger to generate the cudf logger (#17307) @vyasr
Mukernels strings (#17286) @pmattione-nvidia
Add write_parquet to pylibcudf (#17263) @mroeschke
Single-partition Dask executor for cuDF-Polars (#17262) @rjzamora
Add breaking change workflow trigger (#17248) @AyodeAwe
Precompute AST arity (#17234) @bdice
Update to CCCL 2.7.0-rc2. (#17233) @bdice
Make column_empty mask buffer creation consistent with libcudf (#16715) @mroeschke

- C++
Published by AyodeAwe over 1 year ago

https://github.com/rapidsai/cudf - v24.12.00

🚨 Breaking Changes

Fix reading Parquet string cols when nrows and input_pass_limit > 0 (#17321) @mhaseeb123
prefer wheel-provided libcudf.so in loadlibrary(), use RTLDLOCAL (#17316) @jameslamb
Deprecate single component extraction methods in libcudf (#17221) @Matt711
Move detail header floating_conversion.hpp to detail subdirectory (#17209) @davidwendt
Refactor Dask cuDF legacy code (#17205) @rjzamora
Make HostMemoryBuffer call into the DefaultHostMemoryAllocator (#17204) @revans2
Remove java reservation (#17189) @revans2
Separate evaluation logic from IR objects in cudf-polars (#17175) @rjzamora
Upgrade to polars 1.11 in cudf-polars (#17154) @wence-
Remove the additional host register calls initially intended for performance improvement on Grace Hopper (#17092) @kingcrimsontianyu
Correctly set is_device_accesible when creating host_spans from other container/span types (#17079) @vuule
Unify treatment of Expr and IR nodes in cudf-polars DSL (#17016) @wence-
Deprecate support for directly accessing logger (#16964) @vyasr
Made cudftestutil header-only and removed GTest dependency (#16839) @lamarrr

🐛 Bug Fixes

Turn off cudf.pandas 3rd party integrations tests for 24.12 (#17500) @Matt711
Ignore errors when testing glibc versions (#17389) @vyasr
Adapt to KvikIO API change in the compatibility mode (#17377) @kingcrimsontianyu
Support pivot with index or column arguments as lists (#17373) @mroeschke
Deselect failing polars tests (#17362) @pentschev
Fix integer overflow in compiled binaryop (#17354) @wence-
Update cmake to 3.28.6 in JNI Dockerfile (#17342) @jlowe
fix library-loading issues in editable installs (#17338) @jameslamb
Bug fix: restrict lines=True to JSON format in Kafka read_gdf method (#17333) @a-hirota
Fix various issues with replace API and add support in datetime and timedelta columns (#17331) @galipremsagar
Do not exclude nanoarrow and flatbuffers from installation if statically linked (#17322) @hyperbolic2346
Fix reading Parquet string cols when nrows and input_pass_limit > 0 (#17321) @mhaseeb123
Remove another reference to FindcuFile (#17315) @KyleFromNVIDIA
Fix reading of single-row unterminated CSV files (#17305) @vuule
Fixed lifetime issue in ast transform tests (#17292) @lamarrr
Switch to using TaskSpec (#17285) @galipremsagar
Fix datatype ctor call in JSONTEST (#17273) @davidwendt
Expose delimiter character in JSON reader options to JSON reader APIs (#17266) @shrshi
Fix extract-datetime deprecation warning in ndsh benchmark (#17254) @davidwendt
Disallow cuda-python 12.6.1 and 11.8.4 (#17253) @bdice
Wrap custom iterator result (#17251) @galipremsagar
Fix binop with LHS numpy datetimelike scalar (#17226) @mroeschke
Fix Dataframe.__setitem__ slow-downs (#17222) @galipremsagar
Fix groupby.get_group with length-1 tuple with list-like grouper (#17216) @mroeschke
Fix discoverability of submodules inside pd.util (#17215) @galipremsagar
Fix Schema.Builder does not propagate precision value to Builder instance (#17214) @ttnghia
Mark column chunks in a PQ reader pass as large strings when the cumulative offsets exceeds the large strings threshold. (#17207) @mhaseeb123
[BUG] Replace repo_token with github_token in Auto Assign PR GHA (#17203) @Matt711
Remove unsanitized nulls from input strings columns in reduction gtests (#17202) @davidwendt
Fix to_parquet append behavior with global metadata file (#17198) @rjzamora
Check num_children() == 0 in Column.from_column_view (#17193) @cwharris
Fix host-to-device copy missing sync in strings/duration convert (#17149) @davidwendt
Add JNI Support for Multi-line Delimiters and Include Test (#17139) @SurajAralihalli
Ignore loud dask warnings about legacy dataframe implementation (#17137) @galipremsagar
Fix the GDS read/write segfault/bus error when the cuFile policy is set to GDS or ALWAYS (#17122) @kingcrimsontianyu
Fix DataFrame._from_arrays and introduce validations (#17112) @galipremsagar
[Bug] Fix Arrow-FS parquet reader for larger files (#17099) @rjzamora
Fix bug in recovering invalid lines in JSONL inputs (#17098) @shrshi
Reenable huge pages for arrow host copying (#17097) @vyasr
Correctly set is_device_accesible when creating host_spans from other container/span types (#17079) @vuule
Fix ORC reader when using device_read_async while the destination device buffers are not ready (#17074) @ttnghia
Fix regex handling of fixed quantifier with 0 range (#17067) @davidwendt
Limit the number of keys to calculate column sizes and page starts in PQ reader to 1B (#17059) @mhaseeb123
Adding assertion to check for regular JSON inputs of size greater than INT_MAX bytes (#17057) @shrshi
bug fix: use self.ck_consumer in poll method of kafka.py to align with __init__ (#17044) @a-hirota
Disable kvikio remote I/O to avoid openssl dependencies in JNI build (#17026) @pxLi
Fix host_span constructor to correctly copy is_device_accessible (#17020) @vuule
Add pinning for pyarrow in wheels (#17018) @vyasr
Use std::optional for host types (#17015) @robertmaynard
Fix write_json to handle empty string column (#16995) @karthikeyann
Restore export of nvcomp outside of wheel builds (#16988) @KyleFromNVIDIA
Allow melt(var_name=) to be a falsy label (#16981) @mroeschke
Fix astype from tz-aware type to tz-aware type (#16980) @mroeschke
Use libcudf wheel from PR rather than nightly for polars-polars CI test job (#16975) @brandon-b-miller
Fix order-preservation in pandas-compat unsorted groupby (#16942) @wence-
Fix cudf::strings::findall error with empty input (#16928) @davidwendt
Fix JsonLargeReaderTest.MultiBatch use of LIBCUDFJSONBATCH_SIZE env var (#16927) @davidwendt
Parse newline as whitespace character while tokenizing JSONL inputs with non-newline delimiter (#16923) @shrshi
Respect groupby.nunique(dropna=False) (#16921) @mroeschke
Update all rmm imports to use pylibrmm/librmm (#16913) @Matt711
Fix order-preservation in cudf-polars groupby (#16907) @wence-
Add a shortcut for when the input clusters are all empty for the tdigest merge (#16897) @jihoonson
Properly handle the mapped and registered regions in memory_mapped_source (#16865) @vuule
Fix performance regression for generatecharacterngrams (#16849) @davidwendt
Fix regex parsing logic handling of nested quantifiers (#16798) @davidwendt
Compute whole column variance using numerically stable approach (#16448) @wence-

📖 Documentation

Add documentation for low memory readers (#17314) @btepera
Fix the example in documentation for get_dremel_data() (#17242) @mhaseeb123
Fix some documentation rendering for pylibcudf (#17217) @mroeschke
Move detail header floating_conversion.hpp to detail subdirectory (#17209) @davidwendt
Add TokenizeVocabulary to api docs (#17208) @davidwendt
Add jaccard_index to generated cuDF docs (#17199) @davidwendt
[no ci] Add empty-columns section to the libcudf developer guide (#17183) @davidwendt
Add 2-cpp approvers text to contributing guide no ci @davidwendt
Changing developer guide int64t to int64_t (#17130) @hyperbolic2346
docs: change 'CSV' to 'csv' in python/custreamz/README.md to match kafka.py (#17041) @a-hirota
[DOC] Document limitation using cudf.pandas proxy arrays (#16955) @Matt711
[DOC] Document environment variable for failing on fallback in cudf.pandas (#16932) @Matt711

🚀 New Features

Add version config (#17312) @vyasr
Java JNI for Multiple contains (#17281) @res-life
Add cudf::calendrical_month_sequence to pylibcudf (#17277) @Matt711
Raise errors on specific types of fallback in cudf.pandas (#17268) @Matt711
Add catboost to the third-party integration tests (#17267) @Matt711
Add type stubs for pylibcudf (#17258) @wence-
Use pylibcudf contiguous split APIs in cudf python (#17246) @Matt711
Upgrade nvcomp to 4.1.0.6 (#17201) @bdice
Added Arrow Interop Benchmarks (#17194) @lamarrr
Rewrite Java API Table.readJSON to return the output from libcudf read_json directly (#17180) @ttnghia
Support storing precision of decimal types in Schema class (#17176) @ttnghia
Migrate CSV writer to pylibcudf (#17163) @Matt711
Add computesharedmemory_aggs used by shared memory groupby (#17162) @PointKernel
Added ast tree to simplify expression lifetime management (#17156) @lamarrr
Add computemappingindices used by shared memory groupby (#17147) @PointKernel
Add remaining datetime APIs to pylibcudf (#17143) @Matt711
Added strings AST vs BINARY_OP benchmarks (#17128) @lamarrr
Use libcudf_exception_handler throughout pylibcudf.libcudf (#17109) @brandon-b-miller
Include timezone file path in error message (#17102) @bdice
Migrate NVText Byte Pair Encoding APIs to pylibcudf (#17101) @Matt711
Migrate NVText Tokenizing APIs to pylibcudf (#17100) @Matt711
Migrate NVtext subword tokenizing APIs to pylibcudf (#17096) @Matt711
Migrate NVText Stemming APIs to pylibcudf (#17085) @Matt711
Migrate NVText Replacing APIs to pylibcudf (#17084) @Matt711
Add IWYU to CI (#17078) @vyasr
cudf-polars string/numeric casting (#17076) @brandon-b-miller
Migrate NVText Normalizing APIs to Pylibcudf (#17072) @Matt711
Migrate remaining nvtext NGrams APIs to pylibcudf (#17070) @Matt711
Add profilers to CUDA 12 conda devcontainers (#17066) @vyasr
Add conda recipe for cudf-polars (#17037) @bdice
Implement batch construction for strings columns (#17035) @ttnghia
Add device aggregators used by shared memory groupby (#17031) @PointKernel
Add optional column_order in JSON reader (#17029) @karthikeyann
Migrate Min Hashing APIs to pylibcudf (#17021) @Matt711
Reorganize cudf_polars expression code (#17014) @brandon-b-miller
Migrate nvtext jaccard API to pylibcudf (#17007) @Matt711
Migrate nvtext generate_ngrams APIs to pylibcudf (#17006) @Matt711
Control whether a file data source memory-maps the file with an environment variable (#17004) @vuule
Switched BINARY_OP Benchmarks from GoogleBench to NVBench (#16963) @lamarrr
[FEA] Report all unsupported operations for a query in cudf.polars (#16960) @Matt711
[FEA] Migrate nvtext/edit_distance APIs to pylibcudf (#16957) @Matt711
Switched AST benchmarks from GoogleBench to NVBench (#16952) @lamarrr
Extend device_scalar to optionally use pinned bounce buffer (#16947) @vuule
Implement cudf-polars chunked parquet reading (#16944) @brandon-b-miller
Expose streams in public round APIs (#16925) @Matt711
add telemetry setup to test (#16924) @msarahan
Add cudf::strings::contains_multiple (#16900) @davidwendt
Made cudftestutil header-only and removed GTest dependency (#16839) @lamarrr
Add an example to demonstrate multithreaded read_parquet pipelines (#16828) @mhaseeb123
Implement extract_datetime_component in libcudf/pylibcudf (#16776) @brandon-b-miller
Add cudf::strings::find_re API (#16742) @davidwendt
Migrate hashing operations to pylibcudf (#15418) @brandon-b-miller

🛠️ Improvements

Simplify serialization protocols (#17552) @vyasr
Add pynvml as a dependency for dask-cudf (#17386) @pentschev
Enable unified memory by default in cudf_polars (#17375) @galipremsagar
Support polars 1.14 (#17355) @wence-
Remove cudf._lib.quantiles in favor of inlining pylibcudf (#17347) @mroeschke
Remove cudf._lib.labeling in favor of inlining pylibcudf (#17346) @mroeschke
Remove cudf._lib.hash in favor of inlining pylibcudf (#17345) @mroeschke
Remove cudf._lib.concat in favor of inlining pylibcudf (#17344) @mroeschke
Extract GPUEngine config options at translation time (#17339) @rjzamora
Update java datetime APIs to match CUDF. (#17329) @revans2
Move strings url_decode benchmarks to nvbench (#17328) @davidwendt
Move strings translate benchmarks to nvbench (#17325) @davidwendt
Writing compressed output using JSON writer (#17323) @shrshi
Test the full matrix for polars and dask wheels on nightlies (#17320) @vyasr
Remove cudf._lib.avro in favor of inlining pylicudf (#17319) @mroeschke
Move cudf.lib.unary to cudf.core.internals (#17318) @mroeschke
prefer wheel-provided libcudf.so in loadlibrary(), use RTLDLOCAL (#17316) @jameslamb
Clean up misc, unneeded pylibcudf.libcudf in cudf._lib (#17309) @mroeschke
Exclude nanoarrow and flatbuffers from installation (#17308) @vyasr
Update CI jobs to include Polars in nightlies and improve IWYU (#17306) @vyasr
Move strings repeat benchmarks to nvbench (#17304) @davidwendt
Fix synchronization bug in bool parquet mukernels (#17302) @pmattione-nvidia
Move strings replace benchmarks to nvbench (#17301) @davidwendt
Support polars 1.13 (#17299) @wence-
Replace FindcuFile with upstream FindCUDAToolkit support (#17298) @KyleFromNVIDIA
Expose stream-ordering in public transpose API (#17294) @shrshi
Replace workaround of JNI build with CUDFKVIKIOREMOTE_IO=OFF (#17293) @pxLi
cmake option: CUDF_KVIKIO_REMOTE_IO (#17291) @madsbk
Use more pylibcudf Python enums in cudf._lib (#17288) @mroeschke
Use pylibcudf enums in cudf Python quantile (#17287) @mroeschke
enforce wheel size limits, README formatting in CI (#17284) @jameslamb
Use numba-cuda<0.0.18 (#17280) @gmarkall
Add computecolumnexpression to pylibcudf for transform.compute_column (#17279) @mroeschke
Optimize distinct inner join to use set find instead of retrieve (#17278) @PointKernel
remove WheelHelpers.cmake (#17276) @jameslamb
Plumb pylibcudf datetime APIs through cudf python (#17275) @Matt711
Follow up making Python tests more deterministic (#17272) @mroeschke
Use pylibcudf.search APIs in cudf python (#17271) @Matt711
Use pylibcudf.strings.convert.convert_integers.is_integer in cudf python (#17270) @Matt711
Move strings filter benchmarks to nvbench (#17269) @davidwendt
Make constructor of DeviceMemoryBufferView public (#17265) @liurenjie1024
Put a ceiling on cuda-python (#17264) @jameslamb
Always prefer device_reads and device_writes when kvikIO is enabled (#17260) @vuule
Expose streams in public quantile APIs (#17257) @shrshi
Add support for pyarrow-18 (#17256) @galipremsagar
Move strings/numeric convert benchmarks to nvbench (#17255) @davidwendt
Add new dask_cudf.read_parquet API (#17250) @rjzamora
Add readparquetmetadata to pylibcudf (#17245) @mroeschke
Search for kvikio with lowercase (#17243) @vyasr
KvikIO shared library (#17239) @madsbk
Use more pylibcudf.io.types enums in cudf._libs (#17237) @mroeschke
Expose mixed and conditional joins in pylibcudf (#17235) @wence-
Add io.text APIs to pylibcudf (#17232) @mroeschke
Add num_iterations axis to the multi-threaded Parquet benchmarks (#17231) @vuule
Move strings to date/time types benchmarks to nvbench (#17229) @davidwendt
Support for polars 1.12 in cudf-polars (#17227) @wence-
Allow generating large strings in benchmarks (#17224) @davidwendt
Refactor gather/scatter benchmarks for strings (#17223) @davidwendt
Deprecate single component extraction methods in libcudf (#17221) @Matt711
Remove nvtext::load_vocabulary from pylibcudf (#17220) @Matt711
Benchmarking JSON reader for compressed inputs (#17219) @shrshi
Expose stream-ordering in partitioning API (#17213) @shrshi
Move strings::concatenate benchmark to nvbench (#17211) @davidwendt
Expose stream-ordering in subword tokenizer API (#17206) @shrshi
Refactor Dask cuDF legacy code (#17205) @rjzamora
Make HostMemoryBuffer call into the DefaultHostMemoryAllocator (#17204) @revans2
Unified binary_ops and ast benchmarks parameter names (#17200) @lamarrr
Add in new java API for raw host memory allocation (#17197) @revans2
Remove java reservation (#17189) @revans2
Fixed unused attribute compilation error for GCC 13 (#17188) @lamarrr
Change default KvikIO parameters in cuDF: set the thread pool size to 4, and compatibility mode to ON (#17185) @kingcrimsontianyu
Use makedeviceuvector instead of cudaMemcpyAsync in inplacebitmaskbinop (#17181) @davidwendt
Make ai.rapids.cudf.HostMemoryBuffer#copyFromStream public. (#17179) @liurenjie1024
Separate evaluation logic from IR objects in cudf-polars (#17175) @rjzamora
Move nvtext ngrams benchmarks to nvbench (#17173) @davidwendt
Remove includes suggested by include-what-you-use (#17170) @vyasr
Reading multi-source compressed JSONL files (#17161) @shrshi
Process parquet bools with microkernels (#17157) @pmattione-nvidia
Upgrade to polars 1.11 in cudf-polars (#17154) @wence-
Deprecate current libcudf nvtext minhash functions (#17152) @davidwendt
Remove unused variable in internal merge_tdigests utility (#17151) @davidwendt
Use the full ref name of rmm.DeviceBuffer in the sphinx config file (#17150) @Matt711
Move segmented_gather function from the copying module to the lists module (#17148) @Matt711
Use async execution policy for true_if (#17146) @PointKernel
Add conversion from cudf-polars expressions to libcudf ast for parquet filters (#17141) @wence-
devcontainer: replace VAULT_HOST with AWS_ROLE_ARN (#17134) @jjacobelli
Replace direct cudaMemcpyAsync calls with utility functions (limited to cudf::io) (#17132) @vuule
use rapids-generate-pip-constraints to pin to oldest dependencies in CI (#17131) @jameslamb
Set the default number of threads in KvikIO thread pool to 8 (#17126) @kingcrimsontianyu
Fix clang-tidy violations for span.hpp and hostdevice_vector.hpp (#17124) @davidwendt
Disable the Parquet reader's wide lists tables GTest by default (#17120) @mhaseeb123
Add compile time check to ensure the counting_iterator type in counting_transform_iterator fits in size_type (#17118) @mhaseeb123
Minor I/O code quality improvements (#17105) @kingcrimsontianyu
Remove the additional host register calls initially intended for performance improvement on Grace Hopper (#17092) @kingcrimsontianyu
Split hash-based groupby into multiple smaller files to reduce build time (#17089) @PointKernel
build wheels without build isolation (#17088) @jameslamb
Polars: DataFrame Serialization (#17062) @madsbk
Remove unused hash helper functions (#17056) @PointKernel
Add todlpack/fromdlpack APIs to pylibcudf (#17055) @mroeschke
Move flatten_single_pass_aggs to its own TU (#17053) @PointKernel
Replace deprecated cuco APIs with updated versions (#17052) @PointKernel
Refactor ORC dictionary encoding to migrate to the new cuco::static_map (#17049) @mhaseeb123
Move pylibcudf/libcudf/wrappers/decimals to pylibcudf/libcudf/fixed_point (#17048) @mroeschke
make conda installs in CI stricter (part 2) (#17042) @jameslamb
Use managed memory for NDSH benchmarks (#17039) @karthikeyann
Clean up hash-groupby var_hash_functor (#17034) @PointKernel
Add json APIs to pylibcudf (#17025) @mroeschke
Add string.replace_re APIs to pylibcudf (#17023) @mroeschke
Replace old host tree algorithm with new algorithm in JSON reader (#17019) @karthikeyann
Unify treatment of Expr and IR nodes in cudf-polars DSL (#17016) @wence-
make conda installs in CI stricter (#17013) @jameslamb
Pylibcudf: pack and unpack (#17012) @madsbk
Remove unneeded pylibcudf.libcudf.wrappers.duration usage in cudf (#17010) @mroeschke
Add custom "fused" groupby aggregation to Dask cuDF (#17009) @rjzamora
Make tests more deterministic (#17008) @galipremsagar
Remove unused import (#17005) @Matt711
Add string.convert.convert_urls APIs to pylibcudf (#17003) @mroeschke
Add release tracking to project automation scripts (#17001) @jarmak-nv
Implement inequality joins by translation to conditional joins (#17000) @wence-
Add string.convert.convert_lists APIs to pylibcudf (#16997) @mroeschke
Performance optimization of JSON validation (#16996) @karthikeyann
Add string.convert.convert_ipv4 APIs to pylibcudf (#16994) @mroeschke
Add string.convert.convert_integers APIs to pylibcudf (#16991) @mroeschke
Add string.convert_floats APIs to pylibcudf (#16990) @mroeschke
Add string.convert.convertfixedtype APIs to pylibcudf (#16984) @mroeschke
Remove unnecessary std::move's in pylibcudf (#16983) @Matt711
Add docstrings and test for strings.convert_durations APIs for pylibcudf (#16982) @mroeschke
JSON tokenizer memory optimizations (#16978) @shrshi
Turn on xfail_strict = true for all python packages (#16977) @wence-
Add string.convert.convertdatetime/convertbooleans APIs to pylibcudf (#16971) @mroeschke
Auto assign PR to author (#16969) @Matt711
Deprecate support for directly accessing logger (#16964) @vyasr
Expunge NamedColumn (#16962) @wence-
Add clang-tidy to CI (#16958) @vyasr
Address all remaining clang-tidy errors (#16956) @vyasr
Apply clang-tidy autofixes (#16949) @vyasr
Use nvcomp wheel instead of bundling nvcomp (#16946) @KyleFromNVIDIA
Refactor the cuda_memcpy functions to make them more usable (#16945) @vuule
Add string.split APIs to pylibcudf (#16940) @mroeschke
clang-tidy fixes part 3 (#16939) @vyasr
clang-tidy fixes part 2 (#16938) @vyasr
clang-tidy fixes part 1 (#16937) @vyasr
Add string.wrap APIs to pylibcudf (#16935) @mroeschke
Add string.translate APIs to pylibcudf (#16934) @mroeschke
Add string.find_multiple APIs to pylibcudf (#16920) @mroeschke
Batch memcpy the last offsets for output buffers of str and list cols in PQ reader (#16905) @mhaseeb123
reduce wheel build verbosity, narrow deprecation warning filter (#16896) @jameslamb
Improve aggregation device functors (#16884) @PointKernel
Upgrade pandas pinnings to support 2.2.3 (#16882) @galipremsagar
Fix 24.10 to 24.12 forward merge (#16876) @bdice
Manually resolve conflicts in between branch-24.12 and branch-24.10 (#16871) @galipremsagar
Add in support for setting delim when parsing JSON through java (#16867) @revans2
Reapply mixed_semi_join refactoring and bug fixes (#16859) @mhaseeb123
Add string padding and side_type APIs to pylibcudf (#16833) @mroeschke
Organize parquet reader mukernel non-nullable code, introduce manual block scans (#16830) @pmattione-nvidia
Remove superfluous use of std::vector for std::future (#16829) @kingcrimsontianyu
Rework read_csv IO to avoid reading whole input with a single host_read (#16826) @vuule
Add strings.combine APIs to pylibcudf (#16790) @mroeschke
Add remaining string.char_types APIs to pylibcudf (#16788) @mroeschke
Add new nvtext minhash_permuted API (#16756) @davidwendt
Avoid public constructors when called with columns to avoid unnecessary validation (#16747) @mroeschke
Use changed-files shared workflow (#16713) @KyleFromNVIDIA
lint: replace isort with Ruff's rule I (#16685) @Borda
Improve the performance of low cardinality groupby (#16619) @PointKernel
Parquet reader list microkernel (#16538) @pmattione-nvidia
AWS S3 IO through KvikIO (#16499) @madsbk
Refactor histogram reduction using cuco::static_set::insert_and_find (#16485) @srinivasyadav18
Use numba-cuda>=0.0.13 (#16474) @gmarkall

- C++
Published by GPUtester over 1 year ago

https://github.com/rapidsai/cudf - v24.10.01

This hotfix corrected some python packaging issues.

Full Changelog: https://github.com/rapidsai/cudf/compare/v24.10.00...v24.10.01

- C++
Published by raydouglass over 1 year ago

https://github.com/rapidsai/cudf - v24.10.00

🚨 Breaking Changes

Whitespace normalization of nested column coerced as string column in JSONL inputs (#16759) @shrshi
Add libcudf wrappers around currentdeviceresource functions. (#16679) @harrism
Fix empty cluster handling in tdigest merge (#16675) @jihoonson
Remove java ColumnView.copyWithBooleanColumnAsValidity (#16660) @revans2
Support reading multiple PQ sources with mismatching nullability for columns (#16639) @mhaseeb123
Remove arrowiosource (#16607) @vyasr
Remove legacy Arrow interop APIs (#16590) @vyasr
Remove NativeFile support from cudf Python (#16589) @vyasr
Revert "Make proxy NumPy arrays pass isinstance check in cudf.pandas" (#16586) @Matt711
Align public utility function signatures with pandas 2.x (#16565) @mroeschke
Disallow cudf.Index accepting column in favor of .fromcolumn (#16549) @mroeschke
Refactor dictionary encoding in PQ writer to migrate to the new cuco::static_map (#16541) @mhaseeb123
Change IPv4 convert APIs to support UINT32 instead of INT64 (#16489) @davidwendt
enable list to be forced as string in JSON reader. (#16472) @karthikeyann
Disallow cudf.Series to accept column in favor of ._from_column (#16454) @mroeschke
Align groupby APIs with pandas 2.x (#16403) @mroeschke
Align misc DataFrame and MultiIndex methods with pandas 2.x (#16402) @mroeschke
Align Index APIs with pandas 2.x (#16361) @mroeschke
Add stream param to stream compaction APIs (#16295) @JayjeetAtGithub

🐛 Bug Fixes

Add license to the pylibcudf wheel (#16976) @raydouglass
Parse newline as whitespace character while tokenizing JSONL inputs with non-newline delimiter (#16950) @shrshi
Add dask-cudf workaround for missing rename_axis support in cudf (#16899) @rjzamora
Update oldest deps for pyarrow & numpy (#16883) @galipremsagar
Update labeler for pylibcudf (#16868) @vyasr
Revert "Refactor mixedsemijoin using cuco::static_set" (#16855) @mhaseeb123
Fix metadata after implicit array conversion from Dask cuDF (#16842) @rjzamora
Add cudf.pandas dependencies.yaml to update-version.sh (#16840) @raydouglass
Use cupy 12.2.0 as oldest dependency pinning on CUDA 12 ARM (#16808) @bdice
Revert "Fix empty cluster handling in tdigest merge (#16675)" (#16800) @jihoonson
Intentionally leak thread_local CUDA resources to avoid crash (part 1) (#16787) @kingcrimsontianyu
Fix cov/corr bug in dask-cudf (#16786) @rjzamora
Fix slice_strings wide strings logic with multi-byte characters (#16777) @davidwendt
Fix nvbench output for sha512 (#16773) @davidwendt
Allow readcsv(header=None) to return int column labels in `mode.pandascompatible` (#16769) @mroeschke
Whitespace normalization of nested column coerced as string column in JSONL inputs (#16759) @shrshi
Fix DataFrame.drop(columns=cudf.Series/Index, axis=1) (#16712) @mroeschke
Use merge base when calculating changed files (#16709) @KyleFromNVIDIA
Ensure we pass the hasnulls tparam to mixedjoin kernels (#16708) @abellina
Add boost-devel to Java CI Docker image (#16707) @jlowe
[BUG] Add gpu node type to cudf-pandas 3rd-party integration nightly CI job (#16704) @Matt711
Fix typo in column_factories.hpp comment from 'depth 1' to 'depth 2' (#16700) @a-hirota
Fix Series.to_frame(name=None) setting a None name (#16698) @mroeschke
Disable gtests/ERROR_TEST during compute-sanitizer memcheck test (#16691) @davidwendt
Enable batched multi-source reading of JSONL files with large records (#16687) @shrshi
Handle ordered parameter in CategoricalIndex.__repr__ (#16683) @galipremsagar
Fix loc/iloc.setitem[:, loc] with non cupy types (#16677) @mroeschke
Fix empty cluster handling in tdigest merge (#16675) @jihoonson
Fix cudf::rank not getting enough params (#16666) @JayjeetAtGithub
Fix slowdown in CategoricalIndex.__repr__ (#16665) @galipremsagar
Remove java ColumnView.copyWithBooleanColumnAsValidity (#16660) @revans2
Fix slowdown in DataFrame repr in jupyter notebook (#16656) @galipremsagar
Preserve Series name in duplicated method. (#16655) @bdice
Fix interval_range right child non-zero offset (#16651) @mroeschke
fix libcudf wheel publishing, make package-type explicit in wheel publishing (#16650) @jameslamb
Revert "Hide all gtest symbols in cudftestutil (#16546)" (#16644) @robertmaynard
Fix integer overflow in indexalator pointer logic (#16643) @davidwendt
Allow for binops between two differently sized DecimalDtypes (#16638) @mroeschke
Move pragma once in rolling/jit/operation.hpp. (#16636) @bdice
Fix overflow bug in low-memory JSON reader (#16632) @shrshi
Add the missing num_aggregations axis for groupby_max_cardinality (#16630) @PointKernel
Fix strings::detail::copy_range when target contains nulls (#16626) @davidwendt
Fix function parameters with common dependency modified during their evaluation (#16620) @ttnghia
bug-fix: Don't enable the CUDA language if testing was requested when finding cudf (#16615) @cryos
bug-fix: cudf/io/json.hpp use after move (#16609) @NicolasDenoyelle
Remove CUDA whole compilation ODR violations (#16603) @robertmaynard
MAINT: Adapt to numpy hiding flagsobject away (#16593) @seberg
Revert "Make proxy NumPy arrays pass isinstance check in cudf.pandas" (#16586) @Matt711
Switch python version to 3.10 in cudf.pandas pandas test scripts (#16559) @galipremsagar
Hide all gtest symbols in cudftestutil (#16546) @robertmaynard
Update the java code to properly deal with lists being returned as strings (#16536) @revans2
Register read_parquet and read_csv with dask-expr (#16535) @rjzamora
Change cudf::empty_like to not include offsets for empty strings columns (#16529) @davidwendt
Fix DataFrame reductions with median returning scalar instead of Series (#16527) @mroeschke
Allow DataFrame.sort_values(by=) to select an index level (#16519) @mroeschke
Fix date_range(start, end, freq) when end-start is divisible by freq (#16516) @mroeschke
Preserve array name in MultiIndex.from_arrays (#16515) @mroeschke
Disallow indexing by selecting duplicate labels (#16514) @mroeschke
Fix .replace(Index, Index) raising a TypeError (#16513) @mroeschke
Check index bounds in compact protocol reader. (#16493) @bdice
Fix build failures with GCC 13 (#16488) @PointKernel
Fix all-empty input column for strings split APIs (#16466) @davidwendt
Fix segmented-sort overlapped input/output indices (#16463) @davidwendt
Fix merge conflict for auto merge 16447 (#16449) @davidwendt

📖 Documentation

Fix links in Dask cuDF documentation (#16929) @rjzamora
Improve aggregation documentation (#16822) @PointKernel
Add best practices page to Dask cuDF docs (#16821) @rjzamora
[DOC] Update Pylibcudf doc strings (#16810) @Matt711
Recommending miniforge for conda install (#16782) @mmccarty
Add labeling pylibcudf doc pages (#16779) @mroeschke
Migrate dask-cudf README improvements to dask-cudf sphinx docs (#16765) @rjzamora
[DOC] Remove out of date section from cudf.pandas docs (#16697) @Matt711
Add performance tips to cudf.pandas FAQ. (#16693) @bdice
Update documentation for Dask cuDF (#16671) @rjzamora
Add missing pylibcudf strings docs (#16471) @brandon-b-miller
DOC: Refresh pylibcudf guide (#15856) @lithomas1

🚀 New Features

Build cudf-polars with build.sh (#16898) @brandon-b-miller
Add polars to "all" dependency list. (#16875) @bdice
nvCOMP GZIP integration (#16770) @vuule
[FEA] Add support for cudf.NamedAgg (#16744) @Matt711
Add experimental filesystem="arrow" support in dask_cudf.read_parquet (#16684) @rjzamora
Relax Arrow pin (#16681) @vyasr
Add libcudf wrappers around currentdeviceresource functions. (#16679) @harrism
Move NDS-H examples into benchmarks (#16663) @JayjeetAtGithub
[FEA] Add third-party library integration testing of cudf.pandas to cudf (#16645) @Matt711
Make isinstance check pass for proxy ndarrays (#16601) @Matt711
[FEA] Add an environment variable to fail on fallback in cudf.pandas (#16562) @Matt711
[FEA] Add support for cudf.unique (#16554) @Matt711
[FEA] Support named aggregations in df.groupby().agg() (#16528) @Matt711
Change IPv4 convert APIs to support UINT32 instead of INT64 (#16489) @davidwendt
enable list to be forced as string in JSON reader. (#16472) @karthikeyann
Remove cuDF dependency from pylibcudf column from_device tests (#16441) @brandon-b-miller
Enable cudf.pandas REPL and -c command support (#16428) @bdice
Setup pylibcudf package (#16299) @lithomas1
Add a libcudf/thrust-based TPC-H derived datagen (#16294) @JayjeetAtGithub
Make proxy NumPy arrays pass isinstance check in cudf.pandas (#16286) @Matt711
Add skiprows and nrows to parquet reader (#16214) @lithomas1
Upgrade to nvcomp 4.0.1 (#16076) @vuule
Migrate ORC reader to pylibcudf (#16042) @lithomas1
JSON reader validation of values (#15968) @karthikeyann
Implement exposed null mask APIs in pylibcudf (#15908) @charlesbluca
Word-based nvtext::minhash function (#15368) @davidwendt

🛠️ Improvements

Make tests deterministic (#16910) @galipremsagar
Update update-version.sh to use packaging lib (#16891) @AyodeAwe
Pin polars for 24.10 and update polars test suite xfail list (#16886) @wence-
Add in support for setting delim when parsing JSON through java (#16867) (#16880) @revans2
Remove unnecessary flag from build.sh (#16879) @vyasr
Ignore numba warning specific to ARM runners (#16872) @galipremsagar
Display deltas for cudf.pandas test summary (#16864) @galipremsagar
Switch to using native traceback (#16851) @galipremsagar
JSON tree algorithm code reorg (#16836) @karthikeyann
Add string.repeats API to pylibcudf (#16834) @mroeschke
Use CI workflow branch 'branch-24.10' again (#16832) @jameslamb
Rename the NDS-H benchmark binaries (#16831) @JayjeetAtGithub
Add string.findall APIs to pylibcudf (#16825) @mroeschke
Add string.extract APIs to pylibcudf (#16823) @mroeschke
use get-pr-info from nv-gha-runners (#16819) @AyodeAwe
Add string.contains APIs to pylibcudf (#16814) @mroeschke
Forward-merge branch-24.08 to branch-24.10 (#16813) @bdice
Add iotype axis with default `PINNEDBUFFER` to nvbench PQ multithreaded reader (#16809) @mhaseeb123
Update fmt (to 11.0.2) and spdlog (to 1.14.1). (#16806) @jameslamb
Add ability to set parquet row group max #rows and #bytes in java (#16805) @pmattione-nvidia
Add in option for Java JSON APIs to do column pruning in CUDF (#16796) @revans2
Support dropfirst in getdummies (#16795) @mroeschke
Exposed stream-ordering to join API (#16793) @lamarrr
Add string.attributes APIs to pylibcudf (#16785) @mroeschke
Java: Make ColumnVector.fromViewWithContiguousAllocation public (#16784) @jlowe
Add partitioning APIs to pylibcudf (#16781) @mroeschke
Optimization of tdigest merge aggregation. (#16780) @nvdbaranec
use libkvikio wheels in wheel builds (#16778) @jameslamb
Exposed stream-ordering to datetime API (#16774) @lamarrr
Add io/timezone APIs to pylibcudf (#16771) @mroeschke
Remove MultiIndex._poplevel inplace implementation. (#16767) @mroeschke
allow pandas patch version to float in cudf-pandas unit tests (#16763) @jameslamb
Simplify the nvCOMP adapter (#16762) @vuule
Add labeling APIs to pylibcudf (#16761) @mroeschke
Add transform APIs to pylibcudf (#16760) @mroeschke
Add a benchmark to study Parquet reader's performance for wide tables (#16751) @mhaseeb123
Change the Parquet writer's default_row_group_size_bytes from 128MB to inf (#16750) @mhaseeb123
Add transpose API to pylibcudf (#16749) @mroeschke
Add support for Python 3.12, update Kafka dependencies to 2.5.x (#16745) @jameslamb
Generate GPU vs CPU usage metrics per pytest file in pandas testsuite for cudf.pandas (#16739) @galipremsagar
Refactor cudf pandas integration tests CI (#16728) @Matt711
Remove ERROR_TEST gtest from libcudf (#16722) @davidwendt
Use Series.fromcolumn more consistently to avoid validation (#16716) @mroeschke
remove some unnecessary libcudf nightly builds (#16714) @jameslamb
Remove xfail from torch-cudf.pandas integration test (#16705) @Matt711
Add return type annotations to MultiIndex (#16696) @mroeschke
Add type annotations to Index classes, utilize fromcolumn more (#16695) @mroeschke
Have intervalrange use IntervalIndex.frombreaks, remove columnemptysame_mask (#16694) @mroeschke
Increase timeouts for couple of tests (#16692) @galipremsagar
Replace raw devicememoryresource pointer in pylibcudf Cython (#16674) @harrism
switch from typing.Callable to collections.abc.Callable (#16670) @jameslamb
Update rapidsai/pre-commit-hooks (#16669) @KyleFromNVIDIA
Multi-file and Parquet-aware prefetching from remote storage (#16657) @rjzamora
Access Frame attributes instead of ColumnAccessor attributes when available (#16652) @mroeschke
Use non-mangled type names in nvbench output (#16649) @davidwendt
Add pylibcudf build dir in build.sh for clean (#16648) @galipremsagar
Prune workflows based on changed files (#16642) @KyleFromNVIDIA
Remove arrow dependency (#16640) @vyasr
Support reading multiple PQ sources with mismatching nullability for columns (#16639) @mhaseeb123
Drop Python 3.9 support (#16637) @jameslamb
Support DecimalDtype meta in dask_cudf (#16634) @mroeschke
Add num_multiprocessors utility (#16628) @PointKernel
Annotate ColumnAccessor._data labels as Hashable (#16623) @mroeschke
Remove buildcategoricalcolumn in favor of CategoricalColumn constructor (#16617) @mroeschke
Move applybooleanmask benchmark to nvbench (#16616) @davidwendt
Revise get_reader_filepath_or_buffer to handle a list of data sources (#16613) @rjzamora
do not install cudf in cudf_polars wheel tests (#16612) @jameslamb
remove streamz git dependency, standardize build dependency names, consolidate some dependency lists (#16611) @jameslamb
Fix C++ and Cython io types (#16610) @vyasr
Remove arrowiosource (#16607) @vyasr
Remove thrust::optional from expression evaluator (#16604) @bdice
Add stricter typing and validation to ColumnAccessor (#16602) @mroeschke
make more use of YAML anchors in dependencies.yaml (#16597) @jameslamb
Enable testing cudf.pandas unit tests for all minor versions of pandas (#16595) @galipremsagar
Extend the Parquet writer's dictionary encoding benchmark. (#16591) @mhaseeb123
Remove legacy Arrow interop APIs (#16590) @vyasr
Remove NativeFile support from cudf Python (#16589) @vyasr
Add build job for pylibcudf (#16587) @vyasr
Add public qualifier for some member functions in Java class Schema (#16583) @ttnghia
Enable gtests previously disabled for compute-sanitizer bug (#16581) @davidwendt
[FEA] Add filesystem argument to cudf.read_parquet (#16577) @rjzamora
Ensure size is always passed to NumericalColumn (#16576) @mroeschke
standardize and consolidate wheel installations in testing scripts (#16575) @jameslamb
Performance improvement for strings::slice for wide strings (#16574) @davidwendt
Add ToCudfBackend expression to dask-cudf (#16573) @rjzamora
CI: Test against old versions of key dependencies (#16570) @seberg
Replace NativeFile dependency in dask-cudf Parquet reader (#16569) @rjzamora
Align public utility function signatures with pandas 2.x (#16565) @mroeschke
Move libcudf reduction google-benchmarks to nvbench (#16564) @davidwendt
Rework strings::slice benchmark to use nvbench (#16563) @davidwendt
Reenable arrow tests (#16556) @vyasr
Clean up reshaping ops (#16553) @mroeschke
Disallow cudf.Index accepting column in favor of .fromcolumn (#16549) @mroeschke
Rewrite remaining Python Arrow interop conversions using the C Data Interface (#16548) @vyasr
[REVIEW] JSON host tree algorithms (#16545) @shrshi
Refactor dictionary encoding in PQ writer to migrate to the new cuco::static_map (#16541) @mhaseeb123
Remove hardcoded versions from workflows. (#16540) @bdice
Ensure comparisons with pyints and integer series always succeed (#16532) @seberg
Remove unneeded output size parameter from internal count_matches utility (#16531) @davidwendt
Remove invalid column_view usage in string-scalar-to-column function (#16530) @davidwendt
Raise NotImplementedError for Series.rename that's not a scalar (#16525) @mroeschke
Remove deprecated public APIs from libcudf (#16524) @davidwendt
Return Interval object in pandas compat mode for IntervalIndex reductions (#16523) @mroeschke
Update json normalization to take device_buffer (#16520) @karthikeyann
Rework cudf::io::text::byterangeinfo class member functions (#16518) @davidwendt
Remove unneeded pair-iterator benchmark (#16511) @davidwendt
Update pre-commit hooks (#16510) @KyleFromNVIDIA
Improve update-version.sh (#16506) @bdice
Use tool.scikit-build.cmake.version, set scikit-build-core minimum-version (#16503) @jameslamb
Pass batch size to JSON reader using environment variable (#16502) @shrshi
Remove a deprecated multibyte_split API (#16501) @davidwendt
Add interop example for arrow::StringViewArray to cudf::column (#16498) @JayjeetAtGithub
Add keep option to distinct nvbench (#16497) @bdice
Use more idomatic cudf APIs in dask_cudf meta generation (#16487) @mroeschke
Fix typo in dispatchrowequal. (#16473) @bdice
Use explicit construction of column subclass instead of build_column when type is known (#16470) @mroeschke
Move exception handler into pylibcudf from cudf (#16468) @lithomas1
Make StructColumn.init strict (#16467) @mroeschke
Make ListColumn.init strict (#16465) @mroeschke
Make Timedelta/DatetimeColumn.init strict (#16464) @mroeschke
Make NumericalColumn.init strict (#16457) @mroeschke
Make CategoricalColumn.init strict (#16456) @mroeschke
Disallow cudf.Series to accept column in favor of ._from_column (#16454) @mroeschke
Expose stream param in transform APIs (#16452) @JayjeetAtGithub
Add upper bound pin for polars (#16442) @wence-
Make (Indexed)Frame.init require data (and index) (#16430) @mroeschke
Add Java APIs to copy column data to host asynchronously (#16429) @jlowe
Update docs of the TPC-H derived examples (#16423) @JayjeetAtGithub
Use RMM adaptor constructors instead of factories. (#16414) @bdice
Align ewm APIs with pandas 2.x (#16413) @mroeschke
Remove checking for specific tests in memcheck script (#16412) @davidwendt
Add stream parameter to reshape APIs (#16410) @davidwendt
Align groupby APIs with pandas 2.x (#16403) @mroeschke
Align misc DataFrame and MultiIndex methods with pandas 2.x (#16402) @mroeschke
update some branch references in GitHub Actions configs (#16397) @jameslamb
Support reading matching projected and filter cols from Parquet files with otherwise mismatched schemas (#16394) @mhaseeb123
Merge branch-24.08 into branch-24.10 (#16393) @jameslamb
Add query 10 to the TPC-H suite (#16392) @JayjeetAtGithub
Use make_host_vector instead of make_std_vector to facilitate pinned memory optimizations (#16386) @vuule
Fix some issues with deprecated / removed cccl facilities (#16377) @miscco
Align IntervalIndex APIs with pandas 2.x (#16371) @mroeschke
Align CategoricalIndex APIs with pandas 2.x (#16369) @mroeschke
Align TimedeltaIndex APIs with pandas 2.x (#16368) @mroeschke
Align DatetimeIndex APIs with pandas 2.x (#16367) @mroeschke
fix [tool.setuptools] reference in custreamz config (#16365) @jameslamb
Align Index APIs with pandas 2.x (#16361) @mroeschke
Rebuild for & Support NumPy 2 (#16300) @jakirkham
Add stream param to stream compaction APIs (#16295) @JayjeetAtGithub
Added batch memset to memset data and validity buffers in parquet reader (#16281) @sdrp713
Deduplicate decimal32/decimal64 to decimal128 conversion function (#16236) @mhaseeb123
Refactor mixedsemijoin using cuco::static_set (#16230) @srinivasyadav18
Improve performance of hashcharacterngrams using warp-per-string kernel (#16212) @davidwendt
Add environment variable to log cudf.pandas fallback calls (#16161) @mroeschke
Add libcudf example with large strings (#15983) @davidwendt
JSON tree algorithms refactor I: CSR data structure for column tree (#15979) @shrshi
Support multiple new-line characters in regex APIs (#15961) @davidwendt
adding wheel build for libcudf (#15483) @msarahan
Replace usages of thrust::optional with std::optional (#15091) @miscco

- C++
Published by raydouglass over 1 year ago

https://github.com/rapidsai/cudf - [NIGHTLY] v24.12.00

🔗 Links

🚨 Breaking Changes

Move detail header floating_conversion.hpp to detail subdirectory (#17209) @davidwendt
Refactor Dask cuDF legacy code (#17205) @rjzamora
Make HostMemoryBuffer call into the DefaultHostMemoryAllocator (#17204) @revans2
Remove java reservation (#17189) @revans2
Upgrade to polars 1.11 in cudf-polars (#17154) @wence-
Remove the additional host register calls initially intended for performance improvement on Grace Hopper (#17092) @kingcrimsontianyu
Correctly set is_device_accesible when creating host_spans from other container/span types (#17079) @vuule
Unify treatment of Expr and IR nodes in cudf-polars DSL (#17016) @wence-
Deprecate support for directly accessing logger (#16964) @vyasr
Made cudftestutil header-only and removed GTest dependency (#16839) @lamarrr

🐛 Bug Fixes

Fix binop with LHS numpy datetimelike scalar (#17226) @mroeschke
Fix groupby.get_group with length-1 tuple with list-like grouper (#17216) @mroeschke
Fix discoverability of submodules inside pd.util (#17215) @galipremsagar
Fix Schema.Builder does not propagate precision value to Builder instance (#17214) @ttnghia
[BUG] Replace repo_token with github_token in Auto Assign PR GHA (#17203) @Matt711
Remove unsanitized nulls from input strings columns in reduction gtests (#17202) @davidwendt
Fix to_parquet append behavior with global metadata file (#17198) @rjzamora
Check num_children() == 0 in Column.from_column_view (#17193) @cwharris
Fix host-to-device copy missing sync in strings/duration convert (#17149) @davidwendt
Add JNI Support for Multi-line Delimiters and Include Test (#17139) @SurajAralihalli
Ignore loud dask warnings about legacy dataframe implementation (#17137) @galipremsagar
Fix the GDS read/write segfault/bus error when the cuFile policy is set to GDS or ALWAYS (#17122) @kingcrimsontianyu
Fix DataFrame._from_arrays and introduce validations (#17112) @galipremsagar
[Bug] Fix Arrow-FS parquet reader for larger files (#17099) @rjzamora
Fix bug in recovering invalid lines in JSONL inputs (#17098) @shrshi
Reenable huge pages for arrow host copying (#17097) @vyasr
Correctly set is_device_accesible when creating host_spans from other container/span types (#17079) @vuule
Fix ORC reader when using device_read_async while the destination device buffers are not ready (#17074) @ttnghia
Fix regex handling of fixed quantifier with 0 range (#17067) @davidwendt
Limit the number of keys to calculate column sizes and page starts in PQ reader to 1B (#17059) @mhaseeb123
Adding assertion to check for regular JSON inputs of size greater than INT_MAX bytes (#17057) @shrshi
bug fix: use self.ck_consumer in poll method of kafka.py to align with __init__ (#17044) @a-hirota
Disable kvikio remote I/O to avoid openssl dependencies in JNI build (#17026) @pxLi
Fix host_span constructor to correctly copy is_device_accessible (#17020) @vuule
Add pinning for pyarrow in wheels (#17018) @vyasr
Use std::optional for host types (#17015) @robertmaynard
Fix write_json to handle empty string column (#16995) @karthikeyann
Restore export of nvcomp outside of wheel builds (#16988) @KyleFromNVIDIA
Allow melt(var_name=) to be a falsy label (#16981) @mroeschke
Fix astype from tz-aware type to tz-aware type (#16980) @mroeschke
Use libcudf wheel from PR rather than nightly for polars-polars CI test job (#16975) @brandon-b-miller
Fix order-preservation in pandas-compat unsorted groupby (#16942) @wence-
Fix cudf::strings::findall error with empty input (#16928) @davidwendt
Fix JsonLargeReaderTest.MultiBatch use of LIBCUDFJSONBATCH_SIZE env var (#16927) @davidwendt
Parse newline as whitespace character while tokenizing JSONL inputs with non-newline delimiter (#16923) @shrshi
Respect groupby.nunique(dropna=False) (#16921) @mroeschke
Update all rmm imports to use pylibrmm/librmm (#16913) @Matt711
Fix order-preservation in cudf-polars groupby (#16907) @wence-
Add a shortcut for when the input clusters are all empty for the tdigest merge (#16897) @jihoonson
Properly handle the mapped and registered regions in memory_mapped_source (#16865) @vuule
Fix performance regression for generatecharacterngrams (#16849) @davidwendt
Fix regex parsing logic handling of nested quantifiers (#16798) @davidwendt
Compute whole column variance using numerically stable approach (#16448) @wence-

📖 Documentation

Fix some documentation rendering for pylibcudf (#17217) @mroeschke
Move detail header floating_conversion.hpp to detail subdirectory (#17209) @davidwendt
Add TokenizeVocabulary to api docs (#17208) @davidwendt
Add jaccard_index to generated cuDF docs (#17199) @davidwendt
[no ci] Add empty-columns section to the libcudf developer guide (#17183) @davidwendt
Add 2-cpp approvers text to contributing guide no ci @davidwendt
Changing developer guide int64t to int64_t (#17130) @hyperbolic2346
docs: change 'CSV' to 'csv' in python/custreamz/README.md to match kafka.py (#17041) @a-hirota
[DOC] Document limitation using cudf.pandas proxy arrays (#16955) @Matt711
[DOC] Document environment variable for failing on fallback in cudf.pandas (#16932) @Matt711

🚀 New Features

Upgrade nvcomp to 4.1.0.6 (#17201) @bdice
Support storing precision of decimal types in Schema class (#17176) @ttnghia
Add computesharedmemory_aggs used by shared memory groupby (#17162) @PointKernel
Add computemappingindices used by shared memory groupby (#17147) @PointKernel
Add remaining datetime APIs to pylibcudf (#17143) @Matt711
Added strings AST vs BINARY_OP benchmarks (#17128) @lamarrr
Include timezone file path in error message (#17102) @bdice
Migrate NVText Byte Pair Encoding APIs to pylibcudf (#17101) @Matt711
Migrate NVText Tokenizing APIs to pylibcudf (#17100) @Matt711
Migrate NVtext subword tokenizing APIs to pylibcudf (#17096) @Matt711
Migrate NVText Stemming APIs to pylibcudf (#17085) @Matt711
Migrate NVText Replacing APIs to pylibcudf (#17084) @Matt711
Migrate NVText Normalizing APIs to Pylibcudf (#17072) @Matt711
Migrate remaining nvtext NGrams APIs to pylibcudf (#17070) @Matt711
Add profilers to CUDA 12 conda devcontainers (#17066) @vyasr
Add conda recipe for cudf-polars (#17037) @bdice
Implement batch construction for strings columns (#17035) @ttnghia
Add device aggregators used by shared memory groupby (#17031) @PointKernel
Migrate Min Hashing APIs to pylibcudf (#17021) @Matt711
Reorganize cudf_polars expression code (#17014) @brandon-b-miller
Migrate nvtext jaccard API to pylibcudf (#17007) @Matt711
Migrate nvtext generate_ngrams APIs to pylibcudf (#17006) @Matt711
Control whether a file data source memory-maps the file with an environment variable (#17004) @vuule
Switched BINARY_OP Benchmarks from GoogleBench to NVBench (#16963) @lamarrr
[FEA] Migrate nvtext/edit_distance APIs to pylibcudf (#16957) @Matt711
Switched AST benchmarks from GoogleBench to NVBench (#16952) @lamarrr
Extend device_scalar to optionally use pinned bounce buffer (#16947) @vuule
Expose streams in public round APIs (#16925) @Matt711
Made cudftestutil header-only and removed GTest dependency (#16839) @lamarrr
Add an example to demonstrate multithreaded read_parquet pipelines (#16828) @mhaseeb123
Implement extract_datetime_component in libcudf/pylibcudf (#16776) @brandon-b-miller
Add cudf::strings::find_re API (#16742) @davidwendt
Migrate hashing operations to pylibcudf (#15418) @brandon-b-miller

🛠️ Improvements

Use more pylibcudf.io.types enums in cudf._libs (#17237) @mroeschke
Expose mixed and conditional joins in pylibcudf (#17235) @wence-
Add num_iterations axis to the multi-threaded Parquet benchmarks (#17231) @vuule
Support for polars 1.12 in cudf-polars (#17227) @wence-
Remove nvtext::load_vocabulary from pylibcudf (#17220) @Matt711
Expose stream-ordering in partitioning API (#17213) @shrshi
Move strings::concatenate benchmark to nvbench (#17211) @davidwendt
Expose stream-ordering in subword tokenizer API (#17206) @shrshi
Refactor Dask cuDF legacy code (#17205) @rjzamora
Make HostMemoryBuffer call into the DefaultHostMemoryAllocator (#17204) @revans2
Unified binary_ops and ast benchmarks parameter names (#17200) @lamarrr
Add in new java API for raw host memory allocation (#17197) @revans2
Remove java reservation (#17189) @revans2
Fixed unused attribute compilation error for GCC 13 (#17188) @lamarrr
Change default KvikIO parameters in cuDF: set the thread pool size to 4, and compatibility mode to ON (#17185) @kingcrimsontianyu
Use makedeviceuvector instead of cudaMemcpyAsync in inplacebitmaskbinop (#17181) @davidwendt
Make ai.rapids.cudf.HostMemoryBuffer#copyFromStream public. (#17179) @liurenjie1024
Move nvtext ngrams benchmarks to nvbench (#17173) @davidwendt
Remove includes suggested by include-what-you-use (#17170) @vyasr
Upgrade to polars 1.11 in cudf-polars (#17154) @wence-
Deprecate current libcudf nvtext minhash functions (#17152) @davidwendt
Remove unused variable in internal merge_tdigests utility (#17151) @davidwendt
Use the full ref name of rmm.DeviceBuffer in the sphinx config file (#17150) @Matt711
Move segmented_gather function from the copying module to the lists module (#17148) @Matt711
Use async execution policy for true_if (#17146) @PointKernel
Add conversion from cudf-polars expressions to libcudf ast for parquet filters (#17141) @wence-
devcontainer: replace VAULT_HOST with AWS_ROLE_ARN (#17134) @jjacobelli
Replace direct cudaMemcpyAsync calls with utility functions (limited to cudf::io) (#17132) @vuule
use rapids-generate-pip-constraints to pin to oldest dependencies in CI (#17131) @jameslamb
Set the default number of threads in KvikIO thread pool to 8 (#17126) @kingcrimsontianyu
Fix clang-tidy violations for span.hpp and hostdevice_vector.hpp (#17124) @davidwendt
Disable the Parquet reader's wide lists tables GTest by default (#17120) @mhaseeb123
Add compile time check to ensure the counting_iterator type in counting_transform_iterator fits in size_type (#17118) @mhaseeb123
Minor I/O code quality improvements (#17105) @kingcrimsontianyu
Remove the additional host register calls initially intended for performance improvement on Grace Hopper (#17092) @kingcrimsontianyu
Split hash-based groupby into multiple smaller files to reduce build time (#17089) @PointKernel
build wheels without build isolation (#17088) @jameslamb
Remove unused hash helper functions (#17056) @PointKernel
Add todlpack/fromdlpack APIs to pylibcudf (#17055) @mroeschke
Move flatten_single_pass_aggs to its own TU (#17053) @PointKernel
Replace deprecated cuco APIs with updated versions (#17052) @PointKernel
Refactor ORC dictionary encoding to migrate to the new cuco::static_map (#17049) @mhaseeb123
Move pylibcudf/libcudf/wrappers/decimals to pylibcudf/libcudf/fixed_point (#17048) @mroeschke
make conda installs in CI stricter (part 2) (#17042) @jameslamb
Use managed memory for NDSH benchmarks (#17039) @karthikeyann
Clean up hash-groupby var_hash_functor (#17034) @PointKernel
Add json APIs to pylibcudf (#17025) @mroeschke
Add string.replace_re APIs to pylibcudf (#17023) @mroeschke
Replace old host tree algorithm with new algorithm in JSON reader (#17019) @karthikeyann
Unify treatment of Expr and IR nodes in cudf-polars DSL (#17016) @wence-
make conda installs in CI stricter (#17013) @jameslamb
Pylibcudf: pack and unpack (#17012) @madsbk
Remove unneeded pylibcudf.libcudf.wrappers.duration usage in cudf (#17010) @mroeschke
Add custom "fused" groupby aggregation to Dask cuDF (#17009) @rjzamora
Make tests more deterministic (#17008) @galipremsagar
Remove unused import (#17005) @Matt711
Add string.convert.convert_urls APIs to pylibcudf (#17003) @mroeschke
Add release tracking to project automation scripts (#17001) @jarmak-nv
Add string.convert.convert_lists APIs to pylibcudf (#16997) @mroeschke
Performance optimization of JSON validation (#16996) @karthikeyann
Add string.convert.convert_ipv4 APIs to pylibcudf (#16994) @mroeschke
Add string.convert.convert_integers APIs to pylibcudf (#16991) @mroeschke
Add string.convert_floats APIs to pylibcudf (#16990) @mroeschke
Add string.convert.convertfixedtype APIs to pylibcudf (#16984) @mroeschke
Remove unnecessary std::move's in pylibcudf (#16983) @Matt711
Add docstrings and test for strings.convert_durations APIs for pylibcudf (#16982) @mroeschke
JSON tokenizer memory optimizations (#16978) @shrshi
Turn on xfail_strict = true for all python packages (#16977) @wence-
Add string.convert.convertdatetime/convertbooleans APIs to pylibcudf (#16971) @mroeschke
Auto assign PR to author (#16969) @Matt711
Deprecate support for directly accessing logger (#16964) @vyasr
Expunge NamedColumn (#16962) @wence-
Add clang-tidy to CI (#16958) @vyasr
Address all remaining clang-tidy errors (#16956) @vyasr
Apply clang-tidy autofixes (#16949) @vyasr
Use nvcomp wheel instead of bundling nvcomp (#16946) @KyleFromNVIDIA
Refactor the cuda_memcpy functions to make them more usable (#16945) @vuule
Add string.split APIs to pylibcudf (#16940) @mroeschke
clang-tidy fixes part 3 (#16939) @vyasr
clang-tidy fixes part 2 (#16938) @vyasr
clang-tidy fixes part 1 (#16937) @vyasr
Add string.wrap APIs to pylibcudf (#16935) @mroeschke
Add string.translate APIs to pylibcudf (#16934) @mroeschke
Add string.find_multiple APIs to pylibcudf (#16920) @mroeschke
Batch memcpy the last offsets for output buffers of str and list cols in PQ reader (#16905) @mhaseeb123
reduce wheel build verbosity, narrow deprecation warning filter (#16896) @jameslamb
Improve aggregation device functors (#16884) @PointKernel
Upgrade pandas pinnings to support 2.2.3 (#16882) @galipremsagar
Fix 24.10 to 24.12 forward merge (#16876) @bdice
Manually resolve conflicts in between branch-24.12 and branch-24.10 (#16871) @galipremsagar
Add in support for setting delim when parsing JSON through java (#16867) @revans2
Reapply mixed_semi_join refactoring and bug fixes (#16859) @mhaseeb123
Add string padding and side_type APIs to pylibcudf (#16833) @mroeschke
Organize parquet reader mukernel non-nullable code, introduce manual block scans (#16830) @pmattione-nvidia
Remove superfluous use of std::vector for std::future (#16829) @kingcrimsontianyu
Rework read_csv IO to avoid reading whole input with a single host_read (#16826) @vuule
Add strings.combine APIs to pylibcudf (#16790) @mroeschke
Add remaining string.char_types APIs to pylibcudf (#16788) @mroeschke
Avoid public constructors when called with columns to avoid unnecessary validation (#16747) @mroeschke
Use changed-files shared workflow (#16713) @KyleFromNVIDIA
lint: replace isort with Ruff's rule I (#16685) @Borda
Parquet reader list microkernel (#16538) @pmattione-nvidia
Refactor histogram reduction using cuco::static_set::insert_and_find (#16485) @srinivasyadav18
Use numba-cuda>=0.0.13 (#16474) @gmarkall

- C++
Published by rapids-bot[bot] over 1 year ago

https://github.com/rapidsai/cudf - [NIGHTLY] v24.08.00

🔗 Links

🚨 Breaking Changes

Align Index init APIs with pandas 2.x (#16362) @mroeschke
Align Series APIs with pandas 2.x (#16333) @mroeschke
Add missing stream param to dictionary factory APIs (#16319) @JayjeetAtGithub
Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
Remove squeeze argument from groupby (#16312) @mroeschke
Align more DataFrame APIs with pandas (#16310) @mroeschke
Remove mr param from write_csv and write_json (#16231) @JayjeetAtGithub
Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
Deprecate Arrow support in I/O (#16132) @lithomas1
Return FrozenList for Index.names (#16047) @galipremsagar
Add compile option to enable large strings support (#16037) @davidwendt
Hide visibility of non public symbols (#15982) @robertmaynard
Rename strings multiple target replace API (#15898) @davidwendt
Pinned vector factory that uses the global pool (#15895) @vuule
Apply clang-tidy autofixes (#15894) @vyasr
Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow (#15875) @mhaseeb123
Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice

🐛 Bug Fixes

Ensure managed memory is supported in cudf.pandas. (#16552) @bdice
Add flatbuffers to libcudf build (#16446) @galipremsagar
Fix parquetfieldlist read_func lambda capture invalid this pointer (#16440) @davidwendt
Enable prefetching in cudf.pandas.install() (#16439) @bdice
Enable prefetching before runpy (#16427) @galipremsagar
Support thread-safe for prefetch_config::get and prefetch_config::set (#16425) @ttnghia
Fix a pandas-2.0 missing attribute error (#16416) @galipremsagar
[Bug] Remove loud NativeFile deprecation noise for read_parquet from S3 (#16415) @rjzamora
Fix nightly memcheck error for empty STREAMINTEROPTEST (#16406) @davidwendt
Gate ArrowStringArrayNumpySemantics cudf.pandas proxy behind version check (#16401) @mroeschke
Don't export bsthreadpool (#16398) @KyleFromNVIDIA
Require fixed width types for casting in cudf-polars (#16381) @brandon-b-miller
Fix docstring of DataFrame.apply (#16351) @galipremsagar
Make bool raise for more cudf objects (#16311) @mroeschke
Rename .devcontainers for CUDA 12.5 (#16293) @jakirkham
Fix split_record for all empty strings column (#16291) @davidwendt
Fix logic in to_arrow for empty list column (#16279) @wence-
[BUG] Make name attr of Index fast slow attrs (#16270) @Matt711
Add custom name setter and getter for proxy objects in cudf.pandas (#16234) @Matt711
Fall back when casting a timestamp to numeric in cudf-polars (#16232) @brandon-b-miller
Disable large string support for Java build (#16216) @jlowe
Remove CCCL patch for PR 211. (#16207) @bdice
Add single offset to an empty ListArray in cudf::to_arrow (#16201) @davidwendt
Fix memory_usage when calculating nested list column (#16193) @mroeschke
Support at/iat indexers in cudf.pandas (#16177) @mroeschke
Fix unused-return-value debug build error in fromarrowstream_test.cpp (#16168) @davidwendt
Fix cudf::strings::replace_multiple hang on empty target (#16167) @davidwendt
Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
interpolate returns new column if no values are interpolated (#16158) @mroeschke
Use provided memory resource for allocating mixed join results. (#16153) @bdice
Run DFG after verify-alpha-spec (#16151) @KyleFromNVIDIA
Use size_t to allow large conditional joins (#16127) @bdice
Allow only scale=0 fixed-point values in fixedwidthcolumn_wrapper (#16120) @davidwendt
Fix pylibcudf Table.num_rows for 0 columns case and add interop to docs (#16108) @lithomas1
Add support for proxy np.flatiter objects (#16107) @Matt711
Ensure cudf objects can astype to any type when empty (#16106) @mroeschke
Support pd.read_pickle and pd.to_pickle in cudf.pandas (#16105) @Matt711
Fix unnecessarily strict check in parquet chunked reader for choosing split locations. (#16099) @nvdbaranec
Fix is_monotonic_* APIs to include nan's (#16085) @galipremsagar
More safely parse CUDA versions when subprocess output is contaminated (#16067) @brandon-b-miller
fastslowproxy: Don't import assert_eq at top-level (#16063) @wence-
Prevent bad ColumnAccessor state after .sortindex(axis=1, ignoreindex=True) (#16061) @mroeschke
Fix ArrowDeviceArray interface to pass address of event (#16058) @zeroshade
Fix a size overflow bug in hash groupby (#16053) @PointKernel
Fix atomic_ref scope when multiple blocks are updating the same output (#16051) @vuule
Fix initialization error in to_arrow for empty string views (#16033) @wence-
Fix the int32 overflow when computing page fragment sizes for large string columns (#16028) @mhaseeb123
Fix the pool size alignment issue (#16024) @PointKernel
Improve multibyte-split byte-range performance (#16019) @davidwendt
Fix target counting in strings char-parallel replace (#16017) @davidwendt
Support IntervalDtype in cudf.from_pandas (#16014) @mroeschke
Fix memory size in createbyterangeinfosconsecutive (#16012) @davidwendt
Hide visibility of non public symbols (#15982) @robertmaynard
Fix Cython typo preventing proper inheritance (#15978) @vyasr
Fix convertdtypes with convertinteger=False/convert_floating=True (#15964) @mroeschke
Fix nunique for MultiIndex, DataFrame, and all NA case with dropna=False (#15962) @mroeschke
Explicitly build for all GPU architectures (#15959) @vyasr
Preserve column type and class information in more DataFrame operations (#15949) @mroeschke
Add array_interface to cudf.pandas numpy.ndarray proxy (#15936) @mroeschke
Allow tests to be built when stream util is disabled (#15933) @robertmaynard
Fix JSON multi-source reading when total source size exceeds INT_MAX bytes (#15930) @shrshi
Fix dask_cudf.read_parquet regression for legacy timestamp data (#15929) @rjzamora
Fix offsetalator when accessing over 268 million rows (#15921) @davidwendt
Fix debug assert in rowgroupcharcounts_kernel (#15902) @davidwendt
Fix categorical conversion from chunked arrow arrays (#15886) @vyasr
Handling for NaN and inf when converting floating point to fixed point types (#15885) @ttnghia
Manual merge of Branch 24.08 from 24.06 (#15869) @galipremsagar
Avoid unnecessary Index cast in IndexedFrame.index setter (#15843) @charlesbluca
Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
Fix multi-replace target count logic for large strings (#15807) @davidwendt
Fix JSON parsing memory corruption - Fix Mixed types nested children removal (#15798) @karthikeyann
Allow anonymous user in devcontainer name. (#15784) @bdice
Add support for additional metaclasses of proxies and use for ExcelWriter (#15399) @vyasr

📖 Documentation

Improve Polars docs (#16820) @bdice
Add docstring for from_dataframe (#16260) @mroeschke
Update libcudf compiler requirements in contributing doc (#16103) @davidwendt
Add libcudf public/detail API pattern to developer guide (#16086) @davidwendt
Explain line profiler and how to know which functions are GPU-accelerated. (#16079) @bdice
cudf.pandas documentation improvement (#15948) @Matt711
Reland "Fix docs for IO readers and strings_convert" (#15872)" (#15941) @lithomas1
Document how to use cudf.pandas in tandem with multiprocessing (#15940) @wence-
DOC: Add documentation for cudf.pandas in the Developer Guide (#15889) @Matt711
Improve options docs (#15888) @bdice
DOC: add linkcode to docs (#15860) @raybellwaves
DOC: use intersphinx mapping in pandas-compat ext (#15846) @raybellwaves
Fix inconsistent usage of 'results' and 'records' in read-json.md (#15766) @dagardner-nv
Update PandasCompat.py to resolve references (#15704) @raybellwaves

🚀 New Features

Creation of CI artifacts for cudf-polars wheels (#16680) @wence-
Warn on cuDF failure when POLARS_VERBOSE is true (#16308) @brandon-b-miller
Add drop_nulls in cudf-polars (#16290) @brandon-b-miller
[JNI] Add setKernelPinnedCopyThreshold and setPinnedAllocationThreshold (#16288) @abellina
Implement support for scan_ndjson in cudf-polars (#16263) @lithomas1
Publish cudf-polars nightlies (#16213) @lithomas1
Modify make_host_vector and make_device_uvector factories to optionally use pinned memory and kernel copy (#16206) @vuule
Migrate lists/set_operations to pylibcudf (#16190) @Matt711
Migrate lists/filling to pylibcudf (#16189) @Matt711
Fall back to CPU for unsupported libcudf binaryops in cudf-polars (#16188) @brandon-b-miller
Use resourceref for upstream in streamcheckingresourceadaptor (#16187) @harrism
Migrate lists/modifying to pylibcudf (#16185) @Matt711
Migrate lists/filtering to pylibcudf (#16184) @Matt711
Migrate lists/sorting to pylibcudf (#16179) @Matt711
Add missing methods to lists/listcolumnview.pxd in pylibcudf (#16175) @Matt711
Migrate pylibcudf lists gathering (#16170) @Matt711
Move kernel vis over to CUDF_HIDDEN (#16165) @robertmaynard
Add groupby_max multi-threaded benchmark (#16154) @srinivasyadav18
Promote hasnestedcolumns to cudf public API (#16131) @robertmaynard
Promote IO support queries to cudf API (#16125) @robertmaynard
cudf::merge public API now support passing a user stream (#16124) @robertmaynard
Add TPC-H inspired examples for Libcudf (#16088) @JayjeetAtGithub
Installed cudf header use cudf::allocate_like (#16087) @robertmaynard
cudf-polars string slicing (#16082) @brandon-b-miller
Migrate Parquet reader to pylibcudf (#16078) @lithomas1
Migrate lists/count_elements to pylibcudf (#16072) @Matt711
Migrate lists/extract to pylibcudf (#16071) @Matt711
Move common string utilities to public api (#16070) @robertmaynard
stable_distinct public api now has a stream parameter (#16068) @robertmaynard
Migrate expressions to pylibcudf (#16056) @lithomas1
Add support to ArrowDataSource in SourceInfo (#16050) @lithomas1
Experimental support for configurable prefetching (#16020) @vyasr
Migrate CSV reader to pylibcudf (#16011) @lithomas1
Migrate string slice APIs to pylibcudf (#15988) @brandon-b-miller
Migrate lists/contains to pylibcudf (#15981) @Matt711
Remove CCCL 2.2 patches as we now always use 2.5+ (#15969) @robertmaynard
Migrate JSON reader to pylibcudf (#15966) @lithomas1
Add a developer check for proxy objects (#15956) @Matt711
Start migrating I/O writers to pylibcudf (starting with JSON) (#15952) @lithomas1
Kernel copy for pinned memory (#15934) @vuule
Migrate left join and conditional join benchmarks to use nvbench (#15931) @srinivasyadav18
Migrate lists/combine to pylibcudf (#15928) @Matt711
Plumb pylibcudf strings contains_re through cudf_polars (#15918) @brandon-b-miller
Start migrating I/O to pylibcudf (#15899) @lithomas1
Pinned vector factory that uses the global pool (#15895) @vuule
Migrate strings contains operations to pylibcudf (#15880) @brandon-b-miller
Migrate quantile.pxd to pylibcudf (#15874) @lithomas1
Migrate round to pylibcudf (#15863) @lithomas1
Migrate string replace.pxd to pylibcudf (#15839) @lithomas1
Add an Environment Variable for debugging the fast path in cudf.pandas (#15837) @Matt711
Add an option to run cuIO benchmarks with pinned buffers as input (#15830) @vuule
Update pylibcudf testing utilities (#15772) @brandon-b-miller
Migrate string capitalize APIs to pylibcudf (#15503) @brandon-b-miller
Add tests for pylibcudf binaryops (#15470) @brandon-b-miller
Migrate column factories to pylibcudf (#15257) @brandon-b-miller
cuDF/libcudf exponentially weighted moving averages (#9027) @brandon-b-miller

🛠️ Improvements

Ensure objects with interface are converted to cupy/numpy arrays (#16436) @mroeschke
Add about rmm modes in cudf.pandas docs (#16404) @galipremsagar
Gracefully CUDFFAIL when `skiprows > 0` in Chunked Parquet reader (#16385) @mhaseeb123
Make C++ compilation warning free after #16297 (#16379) @wence-
Align Index init APIs with pandas 2.x (#16362) @mroeschke
Use rapidscpmbsthreadpool() (#16360) @KyleFromNVIDIA
Rename PrefetchConfig to prefetch_config. (#16358) @bdice
Implement parquet reading using pylibcudf in cudf-polars (#16346) @lithomas1
Fix compile warnings with jni_utils.hpp (#16336) @ttnghia
Align Series APIs with pandas 2.x (#16333) @mroeschke
Add missing stream param to dictionary factory APIs (#16319) @JayjeetAtGithub
Mark cudf._typing as a typing module in ruff (#16318) @mroeschke
Add stream param to list explode APIs (#16317) @JayjeetAtGithub
Fix polars for 1.2.1 (#16316) @lithomas1
Use workflow branch 24.08 again (#16314) @KyleFromNVIDIA
Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
Remove squeeze argument from groupby (#16312) @mroeschke
Align more DataFrame APIs with pandas (#16310) @mroeschke
Clean unneeded/redudant dtype utils (#16309) @mroeschke
Implement read_csv in cudf-polars using pylibcudf (#16307) @lithomas1
Use Column.cancastsafely instead of some ad-hoc dtype functions in .where (#16303) @mroeschke
Drop {{ pin_compatible('numpy', max_pin='x') }} (#16301) @jakirkham
Host implementation of to_arrow using nanoarrow (#16297) @zeroshade
Add ability to prefetch in cudf.pandas and change default to managed pool (#16296) @galipremsagar
Fix tests for polars 1.2 (#16292) @lithomas1
Introduce dedicated options for low memory readers (#16289) @galipremsagar
Remove decimal/floating 64/128bit switches due to register pressure (#16287) @pmattione-nvidia
Make ColumnAccessor strictly require a mapping of columns (#16285) @mroeschke
Introduce version file so we can conditionally handle things in tests (#16280) @wence-
Type & reduce cupy usage (#16277) @mroeschke
Update cudf::detail::grid1d to use threadindex_type (#16276) @davidwendt
Replace np.isscalar/issubdtype checks with is_scalar/.kind checks (#16275) @mroeschke
Remove xml from sortninjalog.py utility (#16274) @davidwendt
Fix issue in horizontal concat implementation in cudf-polars (#16271) @wence-
Preserve order in left join for cudf-polars (#16268) @wence-
Replace isdatetime/timedeltadtype checks with .kind checks (#16262) @mroeschke
Replace isfloat/integerdtype checks with .kind checks (#16261) @mroeschke
Build and test with CUDA 12.5.1 (#16259) @KyleFromNVIDIA
Replace isbooltype with checking .dtype.kind (#16255) @mroeschke
remove cuco_noexcept.diff (#16254) @trxcllnt
Update contains_tests.cpp to use public cudf::slice (#16253) @davidwendt
Improve the test data for pylibcudf I/O tests (#16247) @lithomas1
Short circuit some Column methods (#16246) @mroeschke
Make nvcomp adapter compatible with new version macros (#16245) @vuule
Add Column.strftime/strptime instead of overloading as_string/datetime/timedelta_column (#16243) @mroeschke
Remove temporary functor overloads required by cuco version bump (#16242) @PointKernel
Remove hashcharacterngrams dependency from jaccard_index (#16241) @davidwendt
Expose sorted groupby parameters to pylibcudf (#16240) @wence-
Expose reflection to check if casting between two types is supported (#16239) @wence-
Handle nans in groupby-aggregations in polars executor (#16233) @wence-
Remove mr param from write_csv and write_json (#16231) @JayjeetAtGithub
Support Literals in groupby-agg (#16218) @wence-
Handler csv reader options in cudf-polars (#16211) @wence-
Update vendored thread_pool implementation (#16210) @wence-
Add low memory JSON reader for cudf.pandas (#16204) @galipremsagar
Clean up state variables in MultiIndex (#16203) @mroeschke
skip CMake 3.30.0 (#16202) @jameslamb
Assert valid metadata is passed in toarrow for listview (#16198) @wence-
Expose type traits to pylibcudf (#16197) @wence-
Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
Cast count aggs to correct dtype in translation (#16192) @wence-
Some small fixes in cudf-polars (#16191) @wence-
split up CUDA-suffixed dependencies in dependencies.yaml (#16183) @jameslamb
Define PTDS for the stream hook libs (#16182) @trxcllnt
Make test_python_cudf_pandas generate requirements.txt (#16181) @trxcllnt
Add environment-agnostic ci/run_cudf_polars_pytest.sh (#16178) @trxcllnt
Implement translation for some unary functions and a single datetime extraction (#16173) @wence-
Remove size constraints on source files in batched JSON reading (#16162) @shrshi
CI: Build wheels for cudf-polars (#16156) @lithomas1
Update cudf-polars for v1 release of polars (#16149) @wence-
Use strings concatenate to support large strings in CSV writer (#16148) @davidwendt
Use verify-alpha-spec hook (#16144) @KyleFromNVIDIA
Adds write-coalescing code path optimization to FST (#16143) @elstehle
MAINT: Adapt to NumPy 2 promotion changes (#16141) @seberg
API: Check for integer overflows when creating scalar form python int (#16140) @seberg
Remove the (unused) implementation of host_parse_nested_json (#16135) @vuule
Deprecate Arrow support in I/O (#16132) @lithomas1
Disable dict support for split-page kernel in the parquet reader. (#16128) @nvdbaranec
Add throughput metrics for REDUCTIONBENCH/REDUCTIONNVBENCH benchmarks (#16126) @jihoonson
Add ensure_index to not unnecessarily shallow copy cudf.Index (#16117) @mroeschke
Make binary operators work between fixed-point and floating args (#16116) @pmattione-nvidia
Implement Ternary copyifelse (#16114) @wence-
Implement handlers for series literal in cudf-polars (#16113) @wence-
Fix dtype errors in StringArrays (#16111) @galipremsagar
Ensure MultiIndex.to_frame deep copies columns (#16110) @mroeschke
Parallelize gpuInitStringDescriptors for fixed length byte array data (#16109) @mhaseeb123
Finish implementation of cudf-polars boolean function handlers (#16098) @wence-
Expose and then implement support for cross joins in cudf-polars (#16097) @wence-
Defer copying in Column.astype(copy=True) (#16095) @mroeschke
Fix segfault in conditional join (#16094) @bdice
Free temp memory no longer needed in multibyte_split processing (#16091) @davidwendt
Rename gather/scatter benchmarks to clarify coalesced behavior. (#16083) @bdice
Adapt to polars upstream changes and turn on CI testing (#16081) @wence-
Reduce/clean copy usage in Series, reshaping (#16080) @mroeschke
Account for FIXEDLENBYTE_ARRAY when calculating fragment sizes in Parquet writer (#16064) @etseidl
Reduce (shallow) copies in DataFrame ops (#16060) @mroeschke
Add multi-file support to dask_cudf.read_json (#16057) @rjzamora
Reduce deep copies in Index ops (#16054) @mroeschke
Implement chunked column wise concat in chunked parquet reader (#16052) @galipremsagar
Add exception when trying to create large strings with cudf::test::stringscolumnwrapper (#16049) @davidwendt
Return FrozenList for Index.names (#16047) @galipremsagar
Add ast cast test (#16045) @pmattione-nvidia
Remove override_dtypes and include_index from Frame._copy_type_metadata (#16043) @mroeschke
Add ruff rules to avoid importing from typing (#16040) @mroeschke
Fix decimal -> float cast in ast code (#16038) @pmattione-nvidia
Add compile option to enable large strings support (#16037) @davidwendt
Reduce conditional_join nvbench configurations (#16036) @srinivasyadav18
Project automation update: skip if not in project (#16035) @jarmak-nv
Add stream parameter to cudf::io::text::multibyte_split (#16034) @davidwendt
Delete unused code from stringfunction evaluator (#16032) @wence-
Fix exclude regex in pre-commit clang-format hook (#16030) @wence-
Refactor rmm usage in cudf.pandas (#16021) @galipremsagar
Enable ruff TCH: typing imports under if TYPE_CHECKING (#16015) @mroeschke
Restrict the allowed pandas timezone objects in cudf (#16013) @mroeschke
orc multithreaded benchmark (#16009) @zpuller
Add tests of expression-based sort and sort-by (#16008) @wence-
Add tests of implemented StringFunctions (#16007) @wence-
Add test that diagonal concat with mismatching schemas raises (#16006) @wence-
Add coverage selecting len from a dataframe (number of rows) (#16005) @wence-
Add basic tests of dataframe scan (#16003) @wence-
Add coverage for both expression and dataframe filter (#16002) @wence-
Remove deprecated ExtContext node (#16001) @wence-
Fix typo bug in gather implementation (#16000) @wence-
Extend coverage of groupby and rolling window nodes (#15999) @wence-
Coverage of binops where one or both operands are a scalar (#15998) @wence-
Add full coverage for whole-frame Agg expressions (#15997) @wence-
Add tests covering magic methods of Expr objects (#15996) @wence-
Add full coverage of utility functions (#15995) @wence-
Test behaviour of containers (#15994) @wence-
Fix implemention of any, all, and isbetween (#15993) @wence-
Raise early on unhandled PythonScan node (#15992) @wence-
Remove mapfunction nodes that don't exist/aren't supported (#15991) @wence-
Add test coverage for slicing with "out of bounds" negative indices (#15990) @wence-
Standardize and type Series.dt methods (#15987) @mroeschke
Refactor distinct with hashset-based algorithms (#15984) @srinivasyadav18
resolve dependency-file-generator warning, remove unnecessary rapids-build-backend configuration (#15980) @jameslamb
Project automation bug fixes (#15971) @jarmak-nv
Add typing to singlecolumnframe (#15965) @mroeschke
Move some misc Frame methods to appropriate locations (#15963) @mroeschke
Condense pylibcudf data fixtures (#15958) @lithomas1
Refactor fillna logic to push specifics toward Frame subclasses and Column subclasses (#15957) @mroeschke
Remove unused parsing utilities (#15955) @vuule
Remove Scalar container type from polars interpreter (#15953) @wence-
Support arbitrary CUDA versions in UDF code (#15950) @bdice
Support large strings in cudf::io::text::multibyte_split (#15947) @davidwendt
Add external issue label and project automation (#15945) @jarmak-nv
Enable round-tripping of large strings in cudf (#15944) @galipremsagar
Add more complete type annotations in polars interpreter (#15942) @wence-
Update implementations to build with the latest cuco (#15938) @PointKernel
Support timezone aware pandas inputs in cudf (#15935) @mroeschke
Define Column.nanasnull to return self (#15923) @mroeschke
Make Frame._dtype an iterator instead of a dict (#15920) @mroeschke
Port start of datetime.hpp to pylibcudf (#15916) @wence-
Introduce NamedColumn concept in cudf-polars (#15914) @wence-
Avoid redefining Frame.getcolumnsbylabel in subclasses (#15912) @mroeschke
Templatization of fixed-width parquet decoding kernels. (#15911) @nvdbaranec
New Decimal <--> Floating conversion (#15905) @pmattione-nvidia
Use Arrow C Data Interface functions for Python interop (#15904) @vyasr
Use offsetalator in cudf::io::json::detail::parse_string (#15900) @davidwendt
Rename strings multiple target replace API (#15898) @davidwendt
Apply clang-tidy autofixes (#15894) @vyasr
Update Python labels and remove unnecessary ones (#15893) @vyasr
Clean up pylibcudf test assertations (#15892) @lithomas1
Use offsetalator in orc rowgroupcharcounts_kernel (#15891) @davidwendt
Ensure literals have correct dtype (#15890) @wence-
Add overflow check when converting large strings to lists columns (#15887) @davidwendt
Use offsetalator in nvtext::tokenizewithvocabulary (#15878) @davidwendt
Update interleave lists column for large strings (#15877) @davidwendt
Simple NumPy 2 fixes that are clearly no behavior change (#15876) @seberg
Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow (#15875) @mhaseeb123
Refactor join benchmarks to target public APIs with the default stream (#15873) @PointKernel
Fix url-decode benchmark to use offsetalator (#15871) @davidwendt
Use offsetalator in strings shift functor (#15870) @davidwendt
Memory Profiling (#15866) @madsbk
Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
Make Frame.astype return Self instead of a ColumnAccessor (#15861) @mroeschke
Use ColumnAccessor row and column length attributes more consistently (#15857) @mroeschke
add unit test setup for cudf_kafka (#15853) @jameslamb
Remove internal usage of core.index.as_index in favor of cudf.Index (#15851) @mroeschke
Ensure cudf.Series(cudf.Series(...)) creates a reference to the same index (#15845) @mroeschke
Remove benchmark-specific use of pinned-pooled memory in Parquet multithreaded benchmark. (#15838) @nvdbaranec
Implement on_bad_lines in json reader (#15834) @galipremsagar
Make Column.to_pandas return Index instead of Series (#15833) @mroeschke
Add test of interoperability of cuDF and arrow BYTESTREAMSPLIT encoders (#15832) @etseidl
Refactor Parquet writer options and builders (#15831) @etseidl
Migrate reshape.pxd to pylibcudf (#15827) @lithomas1
Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice
Switch cuIO benchmarks to use pinned-pool host allocations by default. (#15805) @nvdbaranec
Change thrust::count_if call to raw kernel in strings split APIs (#15762) @davidwendt
Improve performance for long strings for nvtext::replace_tokens (#15756) @davidwendt
Implement chunked parquet reader in cudf-python (#15728) @galipremsagar
Add from_arrow_host functions for cudf interop with nanoarrow (#15645) @zeroshade
Add ability to enable rmm pool on cudf.pandas import (#15628) @galipremsagar
Executor for polars logical plans (#15504) @wence-
Implement dayname and monthname to match pandas (#15479) @btepera
Utilities for decimal <--> floating conversion (#15359) @pmattione-nvidia
For powers of 10, replace ipow with switch (#15353) @pmattione-nvidia
Use rapids-build-backend. (#15245) @vyasr
Add codecov coverage for pandas_tests (#14513) @galipremsagar

- C++
Published by rapids-bot[bot] over 1 year ago

https://github.com/rapidsai/cudf - v24.08.03

🚨 Breaking Changes

Align Index init APIs with pandas 2.x (#16362) @mroeschke
Align Series APIs with pandas 2.x (#16333) @mroeschke
Add missing stream param to dictionary factory APIs (#16319) @JayjeetAtGithub
Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
Remove squeeze argument from groupby (#16312) @mroeschke
Align more DataFrame APIs with pandas (#16310) @mroeschke
Remove mr param from write_csv and write_json (#16231) @JayjeetAtGithub
Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
Deprecate Arrow support in I/O (#16132) @lithomas1
Return FrozenList for Index.names (#16047) @galipremsagar
Add compile option to enable large strings support (#16037) @davidwendt
Hide visibility of non public symbols (#15982) @robertmaynard
Rename strings multiple target replace API (#15898) @davidwendt
Pinned vector factory that uses the global pool (#15895) @vuule
Apply clang-tidy autofixes (#15894) @vyasr
Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow (#15875) @mhaseeb123
Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice

🐛 Bug Fixes

Ensure managed memory is supported in cudf.pandas. (#16552) @bdice
Add flatbuffers to libcudf build (#16446) @galipremsagar
Fix parquetfieldlist read_func lambda capture invalid this pointer (#16440) @davidwendt
Enable prefetching in cudf.pandas.install() (#16439) @bdice
Enable prefetching before runpy (#16427) @galipremsagar
Support thread-safe for prefetch_config::get and prefetch_config::set (#16425) @ttnghia
Fix a pandas-2.0 missing attribute error (#16416) @galipremsagar
[Bug] Remove loud NativeFile deprecation noise for read_parquet from S3 (#16415) @rjzamora
Fix nightly memcheck error for empty STREAMINTEROPTEST (#16406) @davidwendt
Gate ArrowStringArrayNumpySemantics cudf.pandas proxy behind version check (#16401) @mroeschke
Don't export bsthreadpool (#16398) @KyleFromNVIDIA
Require fixed width types for casting in cudf-polars (#16381) @brandon-b-miller
Fix docstring of DataFrame.apply (#16351) @galipremsagar
Make bool raise for more cudf objects (#16311) @mroeschke
Rename .devcontainers for CUDA 12.5 (#16293) @jakirkham
Fix split_record for all empty strings column (#16291) @davidwendt
Fix logic in to_arrow for empty list column (#16279) @wence-
[BUG] Make name attr of Index fast slow attrs (#16270) @Matt711
Add custom name setter and getter for proxy objects in cudf.pandas (#16234) @Matt711
Fall back when casting a timestamp to numeric in cudf-polars (#16232) @brandon-b-miller
Disable large string support for Java build (#16216) @jlowe
Remove CCCL patch for PR 211. (#16207) @bdice
Add single offset to an empty ListArray in cudf::to_arrow (#16201) @davidwendt
Fix memory_usage when calculating nested list column (#16193) @mroeschke
Support at/iat indexers in cudf.pandas (#16177) @mroeschke
Fix unused-return-value debug build error in fromarrowstream_test.cpp (#16168) @davidwendt
Fix cudf::strings::replace_multiple hang on empty target (#16167) @davidwendt
Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
interpolate returns new column if no values are interpolated (#16158) @mroeschke
Use provided memory resource for allocating mixed join results. (#16153) @bdice
Run DFG after verify-alpha-spec (#16151) @KyleFromNVIDIA
Use size_t to allow large conditional joins (#16127) @bdice
Allow only scale=0 fixed-point values in fixedwidthcolumn_wrapper (#16120) @davidwendt
Fix pylibcudf Table.num_rows for 0 columns case and add interop to docs (#16108) @lithomas1
Add support for proxy np.flatiter objects (#16107) @Matt711
Ensure cudf objects can astype to any type when empty (#16106) @mroeschke
Support pd.read_pickle and pd.to_pickle in cudf.pandas (#16105) @Matt711
Fix unnecessarily strict check in parquet chunked reader for choosing split locations. (#16099) @nvdbaranec
Fix is_monotonic_* APIs to include nan's (#16085) @galipremsagar
More safely parse CUDA versions when subprocess output is contaminated (#16067) @brandon-b-miller
fastslowproxy: Don't import assert_eq at top-level (#16063) @wence-
Prevent bad ColumnAccessor state after .sortindex(axis=1, ignoreindex=True) (#16061) @mroeschke
Fix ArrowDeviceArray interface to pass address of event (#16058) @zeroshade
Fix a size overflow bug in hash groupby (#16053) @PointKernel
Fix atomic_ref scope when multiple blocks are updating the same output (#16051) @vuule
Fix initialization error in to_arrow for empty string views (#16033) @wence-
Fix the int32 overflow when computing page fragment sizes for large string columns (#16028) @mhaseeb123
Fix the pool size alignment issue (#16024) @PointKernel
Improve multibyte-split byte-range performance (#16019) @davidwendt
Fix target counting in strings char-parallel replace (#16017) @davidwendt
Support IntervalDtype in cudf.from_pandas (#16014) @mroeschke
Fix memory size in createbyterangeinfosconsecutive (#16012) @davidwendt
Hide visibility of non public symbols (#15982) @robertmaynard
Fix Cython typo preventing proper inheritance (#15978) @vyasr
Fix convertdtypes with convertinteger=False/convert_floating=True (#15964) @mroeschke
Fix nunique for MultiIndex, DataFrame, and all NA case with dropna=False (#15962) @mroeschke
Explicitly build for all GPU architectures (#15959) @vyasr
Preserve column type and class information in more DataFrame operations (#15949) @mroeschke
Add array_interface to cudf.pandas numpy.ndarray proxy (#15936) @mroeschke
Allow tests to be built when stream util is disabled (#15933) @robertmaynard
Fix JSON multi-source reading when total source size exceeds INT_MAX bytes (#15930) @shrshi
Fix dask_cudf.read_parquet regression for legacy timestamp data (#15929) @rjzamora
Fix offsetalator when accessing over 268 million rows (#15921) @davidwendt
Fix debug assert in rowgroupcharcounts_kernel (#15902) @davidwendt
Fix categorical conversion from chunked arrow arrays (#15886) @vyasr
Handling for NaN and inf when converting floating point to fixed point types (#15885) @ttnghia
Manual merge of Branch 24.08 from 24.06 (#15869) @galipremsagar
Avoid unnecessary Index cast in IndexedFrame.index setter (#15843) @charlesbluca
Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
Fix multi-replace target count logic for large strings (#15807) @davidwendt
Fix JSON parsing memory corruption - Fix Mixed types nested children removal (#15798) @karthikeyann
Allow anonymous user in devcontainer name. (#15784) @bdice
Add support for additional metaclasses of proxies and use for ExcelWriter (#15399) @vyasr

📖 Documentation

Add docstring for from_dataframe (#16260) @mroeschke
Update libcudf compiler requirements in contributing doc (#16103) @davidwendt
Add libcudf public/detail API pattern to developer guide (#16086) @davidwendt
Explain line profiler and how to know which functions are GPU-accelerated. (#16079) @bdice
cudf.pandas documentation improvement (#15948) @Matt711
Reland "Fix docs for IO readers and strings_convert" (#15872)" (#15941) @lithomas1
Document how to use cudf.pandas in tandem with multiprocessing (#15940) @wence-
DOC: Add documentation for cudf.pandas in the Developer Guide (#15889) @Matt711
Improve options docs (#15888) @bdice
DOC: add linkcode to docs (#15860) @raybellwaves
DOC: use intersphinx mapping in pandas-compat ext (#15846) @raybellwaves
Fix inconsistent usage of 'results' and 'records' in read-json.md (#15766) @dagardner-nv
Update PandasCompat.py to resolve references (#15704) @raybellwaves

🚀 New Features

Creation of CI artifacts for cudf-polars wheels (#16680) @wence-
Warn on cuDF failure when POLARS_VERBOSE is true (#16308) @brandon-b-miller
Add drop_nulls in cudf-polars (#16290) @brandon-b-miller
[JNI] Add setKernelPinnedCopyThreshold and setPinnedAllocationThreshold (#16288) @abellina
Implement support for scan_ndjson in cudf-polars (#16263) @lithomas1
Publish cudf-polars nightlies (#16213) @lithomas1
Modify make_host_vector and make_device_uvector factories to optionally use pinned memory and kernel copy (#16206) @vuule
Migrate lists/set_operations to pylibcudf (#16190) @Matt711
Migrate lists/filling to pylibcudf (#16189) @Matt711
Fall back to CPU for unsupported libcudf binaryops in cudf-polars (#16188) @brandon-b-miller
Use resourceref for upstream in streamcheckingresourceadaptor (#16187) @harrism
Migrate lists/modifying to pylibcudf (#16185) @Matt711
Migrate lists/filtering to pylibcudf (#16184) @Matt711
Migrate lists/sorting to pylibcudf (#16179) @Matt711
Add missing methods to lists/listcolumnview.pxd in pylibcudf (#16175) @Matt711
Migrate pylibcudf lists gathering (#16170) @Matt711
Move kernel vis over to CUDF_HIDDEN (#16165) @robertmaynard
Add groupby_max multi-threaded benchmark (#16154) @srinivasyadav18
Promote hasnestedcolumns to cudf public API (#16131) @robertmaynard
Promote IO support queries to cudf API (#16125) @robertmaynard
cudf::merge public API now support passing a user stream (#16124) @robertmaynard
Add TPC-H inspired examples for Libcudf (#16088) @JayjeetAtGithub
Installed cudf header use cudf::allocate_like (#16087) @robertmaynard
cudf-polars string slicing (#16082) @brandon-b-miller
Migrate Parquet reader to pylibcudf (#16078) @lithomas1
Migrate lists/count_elements to pylibcudf (#16072) @Matt711
Migrate lists/extract to pylibcudf (#16071) @Matt711
Move common string utilities to public api (#16070) @robertmaynard
stable_distinct public api now has a stream parameter (#16068) @robertmaynard
Migrate expressions to pylibcudf (#16056) @lithomas1
Add support to ArrowDataSource in SourceInfo (#16050) @lithomas1
Experimental support for configurable prefetching (#16020) @vyasr
Migrate CSV reader to pylibcudf (#16011) @lithomas1
Migrate string slice APIs to pylibcudf (#15988) @brandon-b-miller
Migrate lists/contains to pylibcudf (#15981) @Matt711
Remove CCCL 2.2 patches as we now always use 2.5+ (#15969) @robertmaynard
Migrate JSON reader to pylibcudf (#15966) @lithomas1
Add a developer check for proxy objects (#15956) @Matt711
Start migrating I/O writers to pylibcudf (starting with JSON) (#15952) @lithomas1
Kernel copy for pinned memory (#15934) @vuule
Migrate left join and conditional join benchmarks to use nvbench (#15931) @srinivasyadav18
Migrate lists/combine to pylibcudf (#15928) @Matt711
Plumb pylibcudf strings contains_re through cudf_polars (#15918) @brandon-b-miller
Start migrating I/O to pylibcudf (#15899) @lithomas1
Pinned vector factory that uses the global pool (#15895) @vuule
Migrate strings contains operations to pylibcudf (#15880) @brandon-b-miller
Migrate quantile.pxd to pylibcudf (#15874) @lithomas1
Migrate round to pylibcudf (#15863) @lithomas1
Migrate string replace.pxd to pylibcudf (#15839) @lithomas1
Add an Environment Variable for debugging the fast path in cudf.pandas (#15837) @Matt711
Add an option to run cuIO benchmarks with pinned buffers as input (#15830) @vuule
Update pylibcudf testing utilities (#15772) @brandon-b-miller
Migrate string capitalize APIs to pylibcudf (#15503) @brandon-b-miller
Add tests for pylibcudf binaryops (#15470) @brandon-b-miller
Migrate column factories to pylibcudf (#15257) @brandon-b-miller
cuDF/libcudf exponentially weighted moving averages (#9027) @brandon-b-miller

🛠️ Improvements

Ensure objects with interface are converted to cupy/numpy arrays (#16436) @mroeschke
Add about rmm modes in cudf.pandas docs (#16404) @galipremsagar
Gracefully CUDFFAIL when `skiprows > 0` in Chunked Parquet reader (#16385) @mhaseeb123
Make C++ compilation warning free after #16297 (#16379) @wence-
Align Index init APIs with pandas 2.x (#16362) @mroeschke
Use rapidscpmbsthreadpool() (#16360) @KyleFromNVIDIA
Rename PrefetchConfig to prefetch_config. (#16358) @bdice
Implement parquet reading using pylibcudf in cudf-polars (#16346) @lithomas1
Fix compile warnings with jni_utils.hpp (#16336) @ttnghia
Align Series APIs with pandas 2.x (#16333) @mroeschke
Add missing stream param to dictionary factory APIs (#16319) @JayjeetAtGithub
Mark cudf._typing as a typing module in ruff (#16318) @mroeschke
Add stream param to list explode APIs (#16317) @JayjeetAtGithub
Fix polars for 1.2.1 (#16316) @lithomas1
Use workflow branch 24.08 again (#16314) @KyleFromNVIDIA
Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
Remove squeeze argument from groupby (#16312) @mroeschke
Align more DataFrame APIs with pandas (#16310) @mroeschke
Clean unneeded/redudant dtype utils (#16309) @mroeschke
Implement read_csv in cudf-polars using pylibcudf (#16307) @lithomas1
Use Column.cancastsafely instead of some ad-hoc dtype functions in .where (#16303) @mroeschke
Drop {{ pin_compatible('numpy', max_pin='x') }} (#16301) @jakirkham
Host implementation of to_arrow using nanoarrow (#16297) @zeroshade
Add ability to prefetch in cudf.pandas and change default to managed pool (#16296) @galipremsagar
Fix tests for polars 1.2 (#16292) @lithomas1
Introduce dedicated options for low memory readers (#16289) @galipremsagar
Remove decimal/floating 64/128bit switches due to register pressure (#16287) @pmattione-nvidia
Make ColumnAccessor strictly require a mapping of columns (#16285) @mroeschke
Introduce version file so we can conditionally handle things in tests (#16280) @wence-
Type & reduce cupy usage (#16277) @mroeschke
Update cudf::detail::grid1d to use threadindex_type (#16276) @davidwendt
Replace np.isscalar/issubdtype checks with is_scalar/.kind checks (#16275) @mroeschke
Remove xml from sortninjalog.py utility (#16274) @davidwendt
Fix issue in horizontal concat implementation in cudf-polars (#16271) @wence-
Preserve order in left join for cudf-polars (#16268) @wence-
Replace isdatetime/timedeltadtype checks with .kind checks (#16262) @mroeschke
Replace isfloat/integerdtype checks with .kind checks (#16261) @mroeschke
Build and test with CUDA 12.5.1 (#16259) @KyleFromNVIDIA
Replace isbooltype with checking .dtype.kind (#16255) @mroeschke
remove cuco_noexcept.diff (#16254) @trxcllnt
Update contains_tests.cpp to use public cudf::slice (#16253) @davidwendt
Improve the test data for pylibcudf I/O tests (#16247) @lithomas1
Short circuit some Column methods (#16246) @mroeschke
Make nvcomp adapter compatible with new version macros (#16245) @vuule
Add Column.strftime/strptime instead of overloading as_string/datetime/timedelta_column (#16243) @mroeschke
Remove temporary functor overloads required by cuco version bump (#16242) @PointKernel
Remove hashcharacterngrams dependency from jaccard_index (#16241) @davidwendt
Expose sorted groupby parameters to pylibcudf (#16240) @wence-
Expose reflection to check if casting between two types is supported (#16239) @wence-
Handle nans in groupby-aggregations in polars executor (#16233) @wence-
Remove mr param from write_csv and write_json (#16231) @JayjeetAtGithub
Support Literals in groupby-agg (#16218) @wence-
Handler csv reader options in cudf-polars (#16211) @wence-
Update vendored thread_pool implementation (#16210) @wence-
Add low memory JSON reader for cudf.pandas (#16204) @galipremsagar
Clean up state variables in MultiIndex (#16203) @mroeschke
skip CMake 3.30.0 (#16202) @jameslamb
Assert valid metadata is passed in toarrow for listview (#16198) @wence-
Expose type traits to pylibcudf (#16197) @wence-
Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
Cast count aggs to correct dtype in translation (#16192) @wence-
Some small fixes in cudf-polars (#16191) @wence-
split up CUDA-suffixed dependencies in dependencies.yaml (#16183) @jameslamb
Define PTDS for the stream hook libs (#16182) @trxcllnt
Make test_python_cudf_pandas generate requirements.txt (#16181) @trxcllnt
Add environment-agnostic ci/run_cudf_polars_pytest.sh (#16178) @trxcllnt
Implement translation for some unary functions and a single datetime extraction (#16173) @wence-
Remove size constraints on source files in batched JSON reading (#16162) @shrshi
CI: Build wheels for cudf-polars (#16156) @lithomas1
Update cudf-polars for v1 release of polars (#16149) @wence-
Use strings concatenate to support large strings in CSV writer (#16148) @davidwendt
Use verify-alpha-spec hook (#16144) @KyleFromNVIDIA
Adds write-coalescing code path optimization to FST (#16143) @elstehle
MAINT: Adapt to NumPy 2 promotion changes (#16141) @seberg
API: Check for integer overflows when creating scalar form python int (#16140) @seberg
Remove the (unused) implementation of host_parse_nested_json (#16135) @vuule
Deprecate Arrow support in I/O (#16132) @lithomas1
Disable dict support for split-page kernel in the parquet reader. (#16128) @nvdbaranec
Add throughput metrics for REDUCTIONBENCH/REDUCTIONNVBENCH benchmarks (#16126) @jihoonson
Add ensure_index to not unnecessarily shallow copy cudf.Index (#16117) @mroeschke
Make binary operators work between fixed-point and floating args (#16116) @pmattione-nvidia
Implement Ternary copyifelse (#16114) @wence-
Implement handlers for series literal in cudf-polars (#16113) @wence-
Fix dtype errors in StringArrays (#16111) @galipremsagar
Ensure MultiIndex.to_frame deep copies columns (#16110) @mroeschke
Parallelize gpuInitStringDescriptors for fixed length byte array data (#16109) @mhaseeb123
Finish implementation of cudf-polars boolean function handlers (#16098) @wence-
Expose and then implement support for cross joins in cudf-polars (#16097) @wence-
Defer copying in Column.astype(copy=True) (#16095) @mroeschke
Fix segfault in conditional join (#16094) @bdice
Free temp memory no longer needed in multibyte_split processing (#16091) @davidwendt
Rename gather/scatter benchmarks to clarify coalesced behavior. (#16083) @bdice
Adapt to polars upstream changes and turn on CI testing (#16081) @wence-
Reduce/clean copy usage in Series, reshaping (#16080) @mroeschke
Account for FIXEDLENBYTE_ARRAY when calculating fragment sizes in Parquet writer (#16064) @etseidl
Reduce (shallow) copies in DataFrame ops (#16060) @mroeschke
Add multi-file support to dask_cudf.read_json (#16057) @rjzamora
Reduce deep copies in Index ops (#16054) @mroeschke
Implement chunked column wise concat in chunked parquet reader (#16052) @galipremsagar
Add exception when trying to create large strings with cudf::test::stringscolumnwrapper (#16049) @davidwendt
Return FrozenList for Index.names (#16047) @galipremsagar
Add ast cast test (#16045) @pmattione-nvidia
Remove override_dtypes and include_index from Frame._copy_type_metadata (#16043) @mroeschke
Add ruff rules to avoid importing from typing (#16040) @mroeschke
Fix decimal -> float cast in ast code (#16038) @pmattione-nvidia
Add compile option to enable large strings support (#16037) @davidwendt
Reduce conditional_join nvbench configurations (#16036) @srinivasyadav18
Project automation update: skip if not in project (#16035) @jarmak-nv
Add stream parameter to cudf::io::text::multibyte_split (#16034) @davidwendt
Delete unused code from stringfunction evaluator (#16032) @wence-
Fix exclude regex in pre-commit clang-format hook (#16030) @wence-
Refactor rmm usage in cudf.pandas (#16021) @galipremsagar
Enable ruff TCH: typing imports under if TYPE_CHECKING (#16015) @mroeschke
Restrict the allowed pandas timezone objects in cudf (#16013) @mroeschke
orc multithreaded benchmark (#16009) @zpuller
Add tests of expression-based sort and sort-by (#16008) @wence-
Add tests of implemented StringFunctions (#16007) @wence-
Add test that diagonal concat with mismatching schemas raises (#16006) @wence-
Add coverage selecting len from a dataframe (number of rows) (#16005) @wence-
Add basic tests of dataframe scan (#16003) @wence-
Add coverage for both expression and dataframe filter (#16002) @wence-
Remove deprecated ExtContext node (#16001) @wence-
Fix typo bug in gather implementation (#16000) @wence-
Extend coverage of groupby and rolling window nodes (#15999) @wence-
Coverage of binops where one or both operands are a scalar (#15998) @wence-
Add full coverage for whole-frame Agg expressions (#15997) @wence-
Add tests covering magic methods of Expr objects (#15996) @wence-
Add full coverage of utility functions (#15995) @wence-
Test behaviour of containers (#15994) @wence-
Fix implemention of any, all, and isbetween (#15993) @wence-
Raise early on unhandled PythonScan node (#15992) @wence-
Remove mapfunction nodes that don't exist/aren't supported (#15991) @wence-
Add test coverage for slicing with "out of bounds" negative indices (#15990) @wence-
Standardize and type Series.dt methods (#15987) @mroeschke
Refactor distinct with hashset-based algorithms (#15984) @srinivasyadav18
resolve dependency-file-generator warning, remove unnecessary rapids-build-backend configuration (#15980) @jameslamb
Project automation bug fixes (#15971) @jarmak-nv
Add typing to singlecolumnframe (#15965) @mroeschke
Move some misc Frame methods to appropriate locations (#15963) @mroeschke
Condense pylibcudf data fixtures (#15958) @lithomas1
Refactor fillna logic to push specifics toward Frame subclasses and Column subclasses (#15957) @mroeschke
Remove unused parsing utilities (#15955) @vuule
Remove Scalar container type from polars interpreter (#15953) @wence-
Support arbitrary CUDA versions in UDF code (#15950) @bdice
Support large strings in cudf::io::text::multibyte_split (#15947) @davidwendt
Add external issue label and project automation (#15945) @jarmak-nv
Enable round-tripping of large strings in cudf (#15944) @galipremsagar
Add more complete type annotations in polars interpreter (#15942) @wence-
Update implementations to build with the latest cuco (#15938) @PointKernel
Support timezone aware pandas inputs in cudf (#15935) @mroeschke
Define Column.nanasnull to return self (#15923) @mroeschke
Make Frame._dtype an iterator instead of a dict (#15920) @mroeschke
Port start of datetime.hpp to pylibcudf (#15916) @wence-
Introduce NamedColumn concept in cudf-polars (#15914) @wence-
Avoid redefining Frame.getcolumnsbylabel in subclasses (#15912) @mroeschke
Templatization of fixed-width parquet decoding kernels. (#15911) @nvdbaranec
New Decimal <--> Floating conversion (#15905) @pmattione-nvidia
Use Arrow C Data Interface functions for Python interop (#15904) @vyasr
Use offsetalator in cudf::io::json::detail::parse_string (#15900) @davidwendt
Rename strings multiple target replace API (#15898) @davidwendt
Apply clang-tidy autofixes (#15894) @vyasr
Update Python labels and remove unnecessary ones (#15893) @vyasr
Clean up pylibcudf test assertations (#15892) @lithomas1
Use offsetalator in orc rowgroupcharcounts_kernel (#15891) @davidwendt
Ensure literals have correct dtype (#15890) @wence-
Add overflow check when converting large strings to lists columns (#15887) @davidwendt
Use offsetalator in nvtext::tokenizewithvocabulary (#15878) @davidwendt
Update interleave lists column for large strings (#15877) @davidwendt
Simple NumPy 2 fixes that are clearly no behavior change (#15876) @seberg
Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow (#15875) @mhaseeb123
Refactor join benchmarks to target public APIs with the default stream (#15873) @PointKernel
Fix url-decode benchmark to use offsetalator (#15871) @davidwendt
Use offsetalator in strings shift functor (#15870) @davidwendt
Memory Profiling (#15866) @madsbk
Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
Make Frame.astype return Self instead of a ColumnAccessor (#15861) @mroeschke
Use ColumnAccessor row and column length attributes more consistently (#15857) @mroeschke
add unit test setup for cudf_kafka (#15853) @jameslamb
Remove internal usage of core.index.as_index in favor of cudf.Index (#15851) @mroeschke
Ensure cudf.Series(cudf.Series(...)) creates a reference to the same index (#15845) @mroeschke
Remove benchmark-specific use of pinned-pooled memory in Parquet multithreaded benchmark. (#15838) @nvdbaranec
Implement on_bad_lines in json reader (#15834) @galipremsagar
Make Column.to_pandas return Index instead of Series (#15833) @mroeschke
Add test of interoperability of cuDF and arrow BYTESTREAMSPLIT encoders (#15832) @etseidl
Refactor Parquet writer options and builders (#15831) @etseidl
Migrate reshape.pxd to pylibcudf (#15827) @lithomas1
Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice
Switch cuIO benchmarks to use pinned-pool host allocations by default. (#15805) @nvdbaranec
Change thrust::count_if call to raw kernel in strings split APIs (#15762) @davidwendt
Improve performance for long strings for nvtext::replace_tokens (#15756) @davidwendt
Implement chunked parquet reader in cudf-python (#15728) @galipremsagar
Add from_arrow_host functions for cudf interop with nanoarrow (#15645) @zeroshade
Add ability to enable rmm pool on cudf.pandas import (#15628) @galipremsagar
Executor for polars logical plans (#15504) @wence-
Implement dayname and monthname to match pandas (#15479) @btepera
Utilities for decimal <--> floating conversion (#15359) @pmattione-nvidia
For powers of 10, replace ipow with switch (#15353) @pmattione-nvidia
Use rapids-build-backend. (#15245) @vyasr
Add codecov coverage for pandas_tests (#14513) @galipremsagar

- C++
Published by raydouglass over 1 year ago

https://github.com/rapidsai/cudf - v24.08.02

🚨 Breaking Changes

Align Index init APIs with pandas 2.x (#16362) @mroeschke
Align Series APIs with pandas 2.x (#16333) @mroeschke
Add missing stream param to dictionary factory APIs (#16319) @JayjeetAtGithub
Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
Remove squeeze argument from groupby (#16312) @mroeschke
Align more DataFrame APIs with pandas (#16310) @mroeschke
Remove mr param from write_csv and write_json (#16231) @JayjeetAtGithub
Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
Deprecate Arrow support in I/O (#16132) @lithomas1
Return FrozenList for Index.names (#16047) @galipremsagar
Add compile option to enable large strings support (#16037) @davidwendt
Hide visibility of non public symbols (#15982) @robertmaynard
Rename strings multiple target replace API (#15898) @davidwendt
Pinned vector factory that uses the global pool (#15895) @vuule
Apply clang-tidy autofixes (#15894) @vyasr
Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow (#15875) @mhaseeb123
Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice

🐛 Bug Fixes

Ensure managed memory is supported in cudf.pandas. (#16552) @bdice
Add flatbuffers to libcudf build (#16446) @galipremsagar
Fix parquetfieldlist read_func lambda capture invalid this pointer (#16440) @davidwendt
Enable prefetching in cudf.pandas.install() (#16439) @bdice
Enable prefetching before runpy (#16427) @galipremsagar
Support thread-safe for prefetch_config::get and prefetch_config::set (#16425) @ttnghia
Fix a pandas-2.0 missing attribute error (#16416) @galipremsagar
[Bug] Remove loud NativeFile deprecation noise for read_parquet from S3 (#16415) @rjzamora
Fix nightly memcheck error for empty STREAMINTEROPTEST (#16406) @davidwendt
Gate ArrowStringArrayNumpySemantics cudf.pandas proxy behind version check (#16401) @mroeschke
Don't export bsthreadpool (#16398) @KyleFromNVIDIA
Require fixed width types for casting in cudf-polars (#16381) @brandon-b-miller
Fix docstring of DataFrame.apply (#16351) @galipremsagar
Make bool raise for more cudf objects (#16311) @mroeschke
Rename .devcontainers for CUDA 12.5 (#16293) @jakirkham
Fix split_record for all empty strings column (#16291) @davidwendt
Fix logic in to_arrow for empty list column (#16279) @wence-
[BUG] Make name attr of Index fast slow attrs (#16270) @Matt711
Add custom name setter and getter for proxy objects in cudf.pandas (#16234) @Matt711
Fall back when casting a timestamp to numeric in cudf-polars (#16232) @brandon-b-miller
Disable large string support for Java build (#16216) @jlowe
Remove CCCL patch for PR 211. (#16207) @bdice
Add single offset to an empty ListArray in cudf::to_arrow (#16201) @davidwendt
Fix memory_usage when calculating nested list column (#16193) @mroeschke
Support at/iat indexers in cudf.pandas (#16177) @mroeschke
Fix unused-return-value debug build error in fromarrowstream_test.cpp (#16168) @davidwendt
Fix cudf::strings::replace_multiple hang on empty target (#16167) @davidwendt
Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
interpolate returns new column if no values are interpolated (#16158) @mroeschke
Use provided memory resource for allocating mixed join results. (#16153) @bdice
Run DFG after verify-alpha-spec (#16151) @KyleFromNVIDIA
Use size_t to allow large conditional joins (#16127) @bdice
Allow only scale=0 fixed-point values in fixedwidthcolumn_wrapper (#16120) @davidwendt
Fix pylibcudf Table.num_rows for 0 columns case and add interop to docs (#16108) @lithomas1
Add support for proxy np.flatiter objects (#16107) @Matt711
Ensure cudf objects can astype to any type when empty (#16106) @mroeschke
Support pd.read_pickle and pd.to_pickle in cudf.pandas (#16105) @Matt711
Fix unnecessarily strict check in parquet chunked reader for choosing split locations. (#16099) @nvdbaranec
Fix is_monotonic_* APIs to include nan's (#16085) @galipremsagar
More safely parse CUDA versions when subprocess output is contaminated (#16067) @brandon-b-miller
fastslowproxy: Don't import assert_eq at top-level (#16063) @wence-
Prevent bad ColumnAccessor state after .sortindex(axis=1, ignoreindex=True) (#16061) @mroeschke
Fix ArrowDeviceArray interface to pass address of event (#16058) @zeroshade
Fix a size overflow bug in hash groupby (#16053) @PointKernel
Fix atomic_ref scope when multiple blocks are updating the same output (#16051) @vuule
Fix initialization error in to_arrow for empty string views (#16033) @wence-
Fix the int32 overflow when computing page fragment sizes for large string columns (#16028) @mhaseeb123
Fix the pool size alignment issue (#16024) @PointKernel
Improve multibyte-split byte-range performance (#16019) @davidwendt
Fix target counting in strings char-parallel replace (#16017) @davidwendt
Support IntervalDtype in cudf.from_pandas (#16014) @mroeschke
Fix memory size in createbyterangeinfosconsecutive (#16012) @davidwendt
Hide visibility of non public symbols (#15982) @robertmaynard
Fix Cython typo preventing proper inheritance (#15978) @vyasr
Fix convertdtypes with convertinteger=False/convert_floating=True (#15964) @mroeschke
Fix nunique for MultiIndex, DataFrame, and all NA case with dropna=False (#15962) @mroeschke
Explicitly build for all GPU architectures (#15959) @vyasr
Preserve column type and class information in more DataFrame operations (#15949) @mroeschke
Add array_interface to cudf.pandas numpy.ndarray proxy (#15936) @mroeschke
Allow tests to be built when stream util is disabled (#15933) @robertmaynard
Fix JSON multi-source reading when total source size exceeds INT_MAX bytes (#15930) @shrshi
Fix dask_cudf.read_parquet regression for legacy timestamp data (#15929) @rjzamora
Fix offsetalator when accessing over 268 million rows (#15921) @davidwendt
Fix debug assert in rowgroupcharcounts_kernel (#15902) @davidwendt
Fix categorical conversion from chunked arrow arrays (#15886) @vyasr
Handling for NaN and inf when converting floating point to fixed point types (#15885) @ttnghia
Manual merge of Branch 24.08 from 24.06 (#15869) @galipremsagar
Avoid unnecessary Index cast in IndexedFrame.index setter (#15843) @charlesbluca
Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
Fix multi-replace target count logic for large strings (#15807) @davidwendt
Fix JSON parsing memory corruption - Fix Mixed types nested children removal (#15798) @karthikeyann
Allow anonymous user in devcontainer name. (#15784) @bdice
Add support for additional metaclasses of proxies and use for ExcelWriter (#15399) @vyasr

📖 Documentation

Add docstring for from_dataframe (#16260) @mroeschke
Update libcudf compiler requirements in contributing doc (#16103) @davidwendt
Add libcudf public/detail API pattern to developer guide (#16086) @davidwendt
Explain line profiler and how to know which functions are GPU-accelerated. (#16079) @bdice
cudf.pandas documentation improvement (#15948) @Matt711
Reland "Fix docs for IO readers and strings_convert" (#15872)" (#15941) @lithomas1
Document how to use cudf.pandas in tandem with multiprocessing (#15940) @wence-
DOC: Add documentation for cudf.pandas in the Developer Guide (#15889) @Matt711
Improve options docs (#15888) @bdice
DOC: add linkcode to docs (#15860) @raybellwaves
DOC: use intersphinx mapping in pandas-compat ext (#15846) @raybellwaves
Fix inconsistent usage of 'results' and 'records' in read-json.md (#15766) @dagardner-nv
Update PandasCompat.py to resolve references (#15704) @raybellwaves

🚀 New Features

Warn on cuDF failure when POLARS_VERBOSE is true (#16308) @brandon-b-miller
Add drop_nulls in cudf-polars (#16290) @brandon-b-miller
[JNI] Add setKernelPinnedCopyThreshold and setPinnedAllocationThreshold (#16288) @abellina
Implement support for scan_ndjson in cudf-polars (#16263) @lithomas1
Publish cudf-polars nightlies (#16213) @lithomas1
Modify make_host_vector and make_device_uvector factories to optionally use pinned memory and kernel copy (#16206) @vuule
Migrate lists/set_operations to pylibcudf (#16190) @Matt711
Migrate lists/filling to pylibcudf (#16189) @Matt711
Fall back to CPU for unsupported libcudf binaryops in cudf-polars (#16188) @brandon-b-miller
Use resourceref for upstream in streamcheckingresourceadaptor (#16187) @harrism
Migrate lists/modifying to pylibcudf (#16185) @Matt711
Migrate lists/filtering to pylibcudf (#16184) @Matt711
Migrate lists/sorting to pylibcudf (#16179) @Matt711
Add missing methods to lists/listcolumnview.pxd in pylibcudf (#16175) @Matt711
Migrate pylibcudf lists gathering (#16170) @Matt711
Move kernel vis over to CUDF_HIDDEN (#16165) @robertmaynard
Add groupby_max multi-threaded benchmark (#16154) @srinivasyadav18
Promote hasnestedcolumns to cudf public API (#16131) @robertmaynard
Promote IO support queries to cudf API (#16125) @robertmaynard
cudf::merge public API now support passing a user stream (#16124) @robertmaynard
Add TPC-H inspired examples for Libcudf (#16088) @JayjeetAtGithub
Installed cudf header use cudf::allocate_like (#16087) @robertmaynard
cudf-polars string slicing (#16082) @brandon-b-miller
Migrate Parquet reader to pylibcudf (#16078) @lithomas1
Migrate lists/count_elements to pylibcudf (#16072) @Matt711
Migrate lists/extract to pylibcudf (#16071) @Matt711
Move common string utilities to public api (#16070) @robertmaynard
stable_distinct public api now has a stream parameter (#16068) @robertmaynard
Migrate expressions to pylibcudf (#16056) @lithomas1
Add support to ArrowDataSource in SourceInfo (#16050) @lithomas1
Experimental support for configurable prefetching (#16020) @vyasr
Migrate CSV reader to pylibcudf (#16011) @lithomas1
Migrate string slice APIs to pylibcudf (#15988) @brandon-b-miller
Migrate lists/contains to pylibcudf (#15981) @Matt711
Remove CCCL 2.2 patches as we now always use 2.5+ (#15969) @robertmaynard
Migrate JSON reader to pylibcudf (#15966) @lithomas1
Add a developer check for proxy objects (#15956) @Matt711
Start migrating I/O writers to pylibcudf (starting with JSON) (#15952) @lithomas1
Kernel copy for pinned memory (#15934) @vuule
Migrate left join and conditional join benchmarks to use nvbench (#15931) @srinivasyadav18
Migrate lists/combine to pylibcudf (#15928) @Matt711
Plumb pylibcudf strings contains_re through cudf_polars (#15918) @brandon-b-miller
Start migrating I/O to pylibcudf (#15899) @lithomas1
Pinned vector factory that uses the global pool (#15895) @vuule
Migrate strings contains operations to pylibcudf (#15880) @brandon-b-miller
Migrate quantile.pxd to pylibcudf (#15874) @lithomas1
Migrate round to pylibcudf (#15863) @lithomas1
Migrate string replace.pxd to pylibcudf (#15839) @lithomas1
Add an Environment Variable for debugging the fast path in cudf.pandas (#15837) @Matt711
Add an option to run cuIO benchmarks with pinned buffers as input (#15830) @vuule
Update pylibcudf testing utilities (#15772) @brandon-b-miller
Migrate string capitalize APIs to pylibcudf (#15503) @brandon-b-miller
Add tests for pylibcudf binaryops (#15470) @brandon-b-miller
Migrate column factories to pylibcudf (#15257) @brandon-b-miller
cuDF/libcudf exponentially weighted moving averages (#9027) @brandon-b-miller

🛠️ Improvements

Ensure objects with interface are converted to cupy/numpy arrays (#16436) @mroeschke
Add about rmm modes in cudf.pandas docs (#16404) @galipremsagar
Gracefully CUDFFAIL when `skiprows > 0` in Chunked Parquet reader (#16385) @mhaseeb123
Make C++ compilation warning free after #16297 (#16379) @wence-
Align Index init APIs with pandas 2.x (#16362) @mroeschke
Use rapidscpmbsthreadpool() (#16360) @KyleFromNVIDIA
Rename PrefetchConfig to prefetch_config. (#16358) @bdice
Implement parquet reading using pylibcudf in cudf-polars (#16346) @lithomas1
Fix compile warnings with jni_utils.hpp (#16336) @ttnghia
Align Series APIs with pandas 2.x (#16333) @mroeschke
Add missing stream param to dictionary factory APIs (#16319) @JayjeetAtGithub
Mark cudf._typing as a typing module in ruff (#16318) @mroeschke
Add stream param to list explode APIs (#16317) @JayjeetAtGithub
Fix polars for 1.2.1 (#16316) @lithomas1
Use workflow branch 24.08 again (#16314) @KyleFromNVIDIA
Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
Remove squeeze argument from groupby (#16312) @mroeschke
Align more DataFrame APIs with pandas (#16310) @mroeschke
Clean unneeded/redudant dtype utils (#16309) @mroeschke
Implement read_csv in cudf-polars using pylibcudf (#16307) @lithomas1
Use Column.cancastsafely instead of some ad-hoc dtype functions in .where (#16303) @mroeschke
Drop {{ pin_compatible('numpy', max_pin='x') }} (#16301) @jakirkham
Host implementation of to_arrow using nanoarrow (#16297) @zeroshade
Add ability to prefetch in cudf.pandas and change default to managed pool (#16296) @galipremsagar
Fix tests for polars 1.2 (#16292) @lithomas1
Introduce dedicated options for low memory readers (#16289) @galipremsagar
Remove decimal/floating 64/128bit switches due to register pressure (#16287) @pmattione-nvidia
Make ColumnAccessor strictly require a mapping of columns (#16285) @mroeschke
Introduce version file so we can conditionally handle things in tests (#16280) @wence-
Type & reduce cupy usage (#16277) @mroeschke
Update cudf::detail::grid1d to use threadindex_type (#16276) @davidwendt
Replace np.isscalar/issubdtype checks with is_scalar/.kind checks (#16275) @mroeschke
Remove xml from sortninjalog.py utility (#16274) @davidwendt
Fix issue in horizontal concat implementation in cudf-polars (#16271) @wence-
Preserve order in left join for cudf-polars (#16268) @wence-
Replace isdatetime/timedeltadtype checks with .kind checks (#16262) @mroeschke
Replace isfloat/integerdtype checks with .kind checks (#16261) @mroeschke
Build and test with CUDA 12.5.1 (#16259) @KyleFromNVIDIA
Replace isbooltype with checking .dtype.kind (#16255) @mroeschke
remove cuco_noexcept.diff (#16254) @trxcllnt
Update contains_tests.cpp to use public cudf::slice (#16253) @davidwendt
Improve the test data for pylibcudf I/O tests (#16247) @lithomas1
Short circuit some Column methods (#16246) @mroeschke
Make nvcomp adapter compatible with new version macros (#16245) @vuule
Add Column.strftime/strptime instead of overloading as_string/datetime/timedelta_column (#16243) @mroeschke
Remove temporary functor overloads required by cuco version bump (#16242) @PointKernel
Remove hashcharacterngrams dependency from jaccard_index (#16241) @davidwendt
Expose sorted groupby parameters to pylibcudf (#16240) @wence-
Expose reflection to check if casting between two types is supported (#16239) @wence-
Handle nans in groupby-aggregations in polars executor (#16233) @wence-
Remove mr param from write_csv and write_json (#16231) @JayjeetAtGithub
Support Literals in groupby-agg (#16218) @wence-
Handler csv reader options in cudf-polars (#16211) @wence-
Update vendored thread_pool implementation (#16210) @wence-
Add low memory JSON reader for cudf.pandas (#16204) @galipremsagar
Clean up state variables in MultiIndex (#16203) @mroeschke
skip CMake 3.30.0 (#16202) @jameslamb
Assert valid metadata is passed in toarrow for listview (#16198) @wence-
Expose type traits to pylibcudf (#16197) @wence-
Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
Cast count aggs to correct dtype in translation (#16192) @wence-
Some small fixes in cudf-polars (#16191) @wence-
split up CUDA-suffixed dependencies in dependencies.yaml (#16183) @jameslamb
Define PTDS for the stream hook libs (#16182) @trxcllnt
Make test_python_cudf_pandas generate requirements.txt (#16181) @trxcllnt
Add environment-agnostic ci/run_cudf_polars_pytest.sh (#16178) @trxcllnt
Implement translation for some unary functions and a single datetime extraction (#16173) @wence-
Remove size constraints on source files in batched JSON reading (#16162) @shrshi
CI: Build wheels for cudf-polars (#16156) @lithomas1
Update cudf-polars for v1 release of polars (#16149) @wence-
Use strings concatenate to support large strings in CSV writer (#16148) @davidwendt
Use verify-alpha-spec hook (#16144) @KyleFromNVIDIA
Adds write-coalescing code path optimization to FST (#16143) @elstehle
MAINT: Adapt to NumPy 2 promotion changes (#16141) @seberg
API: Check for integer overflows when creating scalar form python int (#16140) @seberg
Remove the (unused) implementation of host_parse_nested_json (#16135) @vuule
Deprecate Arrow support in I/O (#16132) @lithomas1
Disable dict support for split-page kernel in the parquet reader. (#16128) @nvdbaranec
Add throughput metrics for REDUCTIONBENCH/REDUCTIONNVBENCH benchmarks (#16126) @jihoonson
Add ensure_index to not unnecessarily shallow copy cudf.Index (#16117) @mroeschke
Make binary operators work between fixed-point and floating args (#16116) @pmattione-nvidia
Implement Ternary copyifelse (#16114) @wence-
Implement handlers for series literal in cudf-polars (#16113) @wence-
Fix dtype errors in StringArrays (#16111) @galipremsagar
Ensure MultiIndex.to_frame deep copies columns (#16110) @mroeschke
Parallelize gpuInitStringDescriptors for fixed length byte array data (#16109) @mhaseeb123
Finish implementation of cudf-polars boolean function handlers (#16098) @wence-
Expose and then implement support for cross joins in cudf-polars (#16097) @wence-
Defer copying in Column.astype(copy=True) (#16095) @mroeschke
Fix segfault in conditional join (#16094) @bdice
Free temp memory no longer needed in multibyte_split processing (#16091) @davidwendt
Rename gather/scatter benchmarks to clarify coalesced behavior. (#16083) @bdice
Adapt to polars upstream changes and turn on CI testing (#16081) @wence-
Reduce/clean copy usage in Series, reshaping (#16080) @mroeschke
Account for FIXEDLENBYTE_ARRAY when calculating fragment sizes in Parquet writer (#16064) @etseidl
Reduce (shallow) copies in DataFrame ops (#16060) @mroeschke
Add multi-file support to dask_cudf.read_json (#16057) @rjzamora
Reduce deep copies in Index ops (#16054) @mroeschke
Implement chunked column wise concat in chunked parquet reader (#16052) @galipremsagar
Add exception when trying to create large strings with cudf::test::stringscolumnwrapper (#16049) @davidwendt
Return FrozenList for Index.names (#16047) @galipremsagar
Add ast cast test (#16045) @pmattione-nvidia
Remove override_dtypes and include_index from Frame._copy_type_metadata (#16043) @mroeschke
Add ruff rules to avoid importing from typing (#16040) @mroeschke
Fix decimal -> float cast in ast code (#16038) @pmattione-nvidia
Add compile option to enable large strings support (#16037) @davidwendt
Reduce conditional_join nvbench configurations (#16036) @srinivasyadav18
Project automation update: skip if not in project (#16035) @jarmak-nv
Add stream parameter to cudf::io::text::multibyte_split (#16034) @davidwendt
Delete unused code from stringfunction evaluator (#16032) @wence-
Fix exclude regex in pre-commit clang-format hook (#16030) @wence-
Refactor rmm usage in cudf.pandas (#16021) @galipremsagar
Enable ruff TCH: typing imports under if TYPE_CHECKING (#16015) @mroeschke
Restrict the allowed pandas timezone objects in cudf (#16013) @mroeschke
orc multithreaded benchmark (#16009) @zpuller
Add tests of expression-based sort and sort-by (#16008) @wence-
Add tests of implemented StringFunctions (#16007) @wence-
Add test that diagonal concat with mismatching schemas raises (#16006) @wence-
Add coverage selecting len from a dataframe (number of rows) (#16005) @wence-
Add basic tests of dataframe scan (#16003) @wence-
Add coverage for both expression and dataframe filter (#16002) @wence-
Remove deprecated ExtContext node (#16001) @wence-
Fix typo bug in gather implementation (#16000) @wence-
Extend coverage of groupby and rolling window nodes (#15999) @wence-
Coverage of binops where one or both operands are a scalar (#15998) @wence-
Add full coverage for whole-frame Agg expressions (#15997) @wence-
Add tests covering magic methods of Expr objects (#15996) @wence-
Add full coverage of utility functions (#15995) @wence-
Test behaviour of containers (#15994) @wence-
Fix implemention of any, all, and isbetween (#15993) @wence-
Raise early on unhandled PythonScan node (#15992) @wence-
Remove mapfunction nodes that don't exist/aren't supported (#15991) @wence-
Add test coverage for slicing with "out of bounds" negative indices (#15990) @wence-
Standardize and type Series.dt methods (#15987) @mroeschke
Refactor distinct with hashset-based algorithms (#15984) @srinivasyadav18
resolve dependency-file-generator warning, remove unnecessary rapids-build-backend configuration (#15980) @jameslamb
Project automation bug fixes (#15971) @jarmak-nv
Add typing to singlecolumnframe (#15965) @mroeschke
Move some misc Frame methods to appropriate locations (#15963) @mroeschke
Condense pylibcudf data fixtures (#15958) @lithomas1
Refactor fillna logic to push specifics toward Frame subclasses and Column subclasses (#15957) @mroeschke
Remove unused parsing utilities (#15955) @vuule
Remove Scalar container type from polars interpreter (#15953) @wence-
Support arbitrary CUDA versions in UDF code (#15950) @bdice
Support large strings in cudf::io::text::multibyte_split (#15947) @davidwendt
Add external issue label and project automation (#15945) @jarmak-nv
Enable round-tripping of large strings in cudf (#15944) @galipremsagar
Add more complete type annotations in polars interpreter (#15942) @wence-
Update implementations to build with the latest cuco (#15938) @PointKernel
Support timezone aware pandas inputs in cudf (#15935) @mroeschke
Define Column.nanasnull to return self (#15923) @mroeschke
Make Frame._dtype an iterator instead of a dict (#15920) @mroeschke
Port start of datetime.hpp to pylibcudf (#15916) @wence-
Introduce NamedColumn concept in cudf-polars (#15914) @wence-
Avoid redefining Frame.getcolumnsbylabel in subclasses (#15912) @mroeschke
Templatization of fixed-width parquet decoding kernels. (#15911) @nvdbaranec
New Decimal <--> Floating conversion (#15905) @pmattione-nvidia
Use Arrow C Data Interface functions for Python interop (#15904) @vyasr
Use offsetalator in cudf::io::json::detail::parse_string (#15900) @davidwendt
Rename strings multiple target replace API (#15898) @davidwendt
Apply clang-tidy autofixes (#15894) @vyasr
Update Python labels and remove unnecessary ones (#15893) @vyasr
Clean up pylibcudf test assertations (#15892) @lithomas1
Use offsetalator in orc rowgroupcharcounts_kernel (#15891) @davidwendt
Ensure literals have correct dtype (#15890) @wence-
Add overflow check when converting large strings to lists columns (#15887) @davidwendt
Use offsetalator in nvtext::tokenizewithvocabulary (#15878) @davidwendt
Update interleave lists column for large strings (#15877) @davidwendt
Simple NumPy 2 fixes that are clearly no behavior change (#15876) @seberg
Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow (#15875) @mhaseeb123
Refactor join benchmarks to target public APIs with the default stream (#15873) @PointKernel
Fix url-decode benchmark to use offsetalator (#15871) @davidwendt
Use offsetalator in strings shift functor (#15870) @davidwendt
Memory Profiling (#15866) @madsbk
Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
Make Frame.astype return Self instead of a ColumnAccessor (#15861) @mroeschke
Use ColumnAccessor row and column length attributes more consistently (#15857) @mroeschke
add unit test setup for cudf_kafka (#15853) @jameslamb
Remove internal usage of core.index.as_index in favor of cudf.Index (#15851) @mroeschke
Ensure cudf.Series(cudf.Series(...)) creates a reference to the same index (#15845) @mroeschke
Remove benchmark-specific use of pinned-pooled memory in Parquet multithreaded benchmark. (#15838) @nvdbaranec
Implement on_bad_lines in json reader (#15834) @galipremsagar
Make Column.to_pandas return Index instead of Series (#15833) @mroeschke
Add test of interoperability of cuDF and arrow BYTESTREAMSPLIT encoders (#15832) @etseidl
Refactor Parquet writer options and builders (#15831) @etseidl
Migrate reshape.pxd to pylibcudf (#15827) @lithomas1
Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice
Switch cuIO benchmarks to use pinned-pool host allocations by default. (#15805) @nvdbaranec
Change thrust::count_if call to raw kernel in strings split APIs (#15762) @davidwendt
Improve performance for long strings for nvtext::replace_tokens (#15756) @davidwendt
Implement chunked parquet reader in cudf-python (#15728) @galipremsagar
Add from_arrow_host functions for cudf interop with nanoarrow (#15645) @zeroshade
Add ability to enable rmm pool on cudf.pandas import (#15628) @galipremsagar
Executor for polars logical plans (#15504) @wence-
Implement dayname and monthname to match pandas (#15479) @btepera
Utilities for decimal <--> floating conversion (#15359) @pmattione-nvidia
For powers of 10, replace ipow with switch (#15353) @pmattione-nvidia
Use rapids-build-backend. (#15245) @vyasr
Add codecov coverage for pandas_tests (#14513) @galipremsagar

- C++
Published by raydouglass almost 2 years ago

https://github.com/rapidsai/cudf - v24.08.00

🚨 Breaking Changes

Align Index init APIs with pandas 2.x (#16362) @mroeschke
Align Series APIs with pandas 2.x (#16333) @mroeschke
Add missing stream param to dictionary factory APIs (#16319) @JayjeetAtGithub
Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
Remove squeeze argument from groupby (#16312) @mroeschke
Align more DataFrame APIs with pandas (#16310) @mroeschke
Remove mr param from write_csv and write_json (#16231) @JayjeetAtGithub
Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
Deprecate Arrow support in I/O (#16132) @lithomas1
Return FrozenList for Index.names (#16047) @galipremsagar
Add compile option to enable large strings support (#16037) @davidwendt
Hide visibility of non public symbols (#15982) @robertmaynard
Rename strings multiple target replace API (#15898) @davidwendt
Pinned vector factory that uses the global pool (#15895) @vuule
Apply clang-tidy autofixes (#15894) @vyasr
Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow (#15875) @mhaseeb123
Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice

🐛 Bug Fixes

Add flatbuffers to libcudf build (#16446) @galipremsagar
Fix parquetfieldlist read_func lambda capture invalid this pointer (#16440) @davidwendt
Enable prefetching in cudf.pandas.install() (#16439) @bdice
Enable prefetching before runpy (#16427) @galipremsagar
Support thread-safe for prefetch_config::get and prefetch_config::set (#16425) @ttnghia
Fix a pandas-2.0 missing attribute error (#16416) @galipremsagar
[Bug] Remove loud NativeFile deprecation noise for read_parquet from S3 (#16415) @rjzamora
Fix nightly memcheck error for empty STREAMINTEROPTEST (#16406) @davidwendt
Gate ArrowStringArrayNumpySemantics cudf.pandas proxy behind version check (#16401) @mroeschke
Don't export bsthreadpool (#16398) @KyleFromNVIDIA
Require fixed width types for casting in cudf-polars (#16381) @brandon-b-miller
Fix docstring of DataFrame.apply (#16351) @galipremsagar
Make bool raise for more cudf objects (#16311) @mroeschke
Rename .devcontainers for CUDA 12.5 (#16293) @jakirkham
Fix split_record for all empty strings column (#16291) @davidwendt
Fix logic in to_arrow for empty list column (#16279) @wence-
[BUG] Make name attr of Index fast slow attrs (#16270) @Matt711
Add custom name setter and getter for proxy objects in cudf.pandas (#16234) @Matt711
Fall back when casting a timestamp to numeric in cudf-polars (#16232) @brandon-b-miller
Disable large string support for Java build (#16216) @jlowe
Remove CCCL patch for PR 211. (#16207) @bdice
Add single offset to an empty ListArray in cudf::to_arrow (#16201) @davidwendt
Fix memory_usage when calculating nested list column (#16193) @mroeschke
Support at/iat indexers in cudf.pandas (#16177) @mroeschke
Fix unused-return-value debug build error in fromarrowstream_test.cpp (#16168) @davidwendt
Fix cudf::strings::replace_multiple hang on empty target (#16167) @davidwendt
Refactor fromarrowdevice/host to use resource_ref (#16160) @harrism
interpolate returns new column if no values are interpolated (#16158) @mroeschke
Use provided memory resource for allocating mixed join results. (#16153) @bdice
Run DFG after verify-alpha-spec (#16151) @KyleFromNVIDIA
Use size_t to allow large conditional joins (#16127) @bdice
Allow only scale=0 fixed-point values in fixedwidthcolumn_wrapper (#16120) @davidwendt
Fix pylibcudf Table.num_rows for 0 columns case and add interop to docs (#16108) @lithomas1
Add support for proxy np.flatiter objects (#16107) @Matt711
Ensure cudf objects can astype to any type when empty (#16106) @mroeschke
Support pd.read_pickle and pd.to_pickle in cudf.pandas (#16105) @Matt711
Fix unnecessarily strict check in parquet chunked reader for choosing split locations. (#16099) @nvdbaranec
Fix is_monotonic_* APIs to include nan's (#16085) @galipremsagar
More safely parse CUDA versions when subprocess output is contaminated (#16067) @brandon-b-miller
fastslowproxy: Don't import assert_eq at top-level (#16063) @wence-
Prevent bad ColumnAccessor state after .sortindex(axis=1, ignoreindex=True) (#16061) @mroeschke
Fix ArrowDeviceArray interface to pass address of event (#16058) @zeroshade
Fix a size overflow bug in hash groupby (#16053) @PointKernel
Fix atomic_ref scope when multiple blocks are updating the same output (#16051) @vuule
Fix initialization error in to_arrow for empty string views (#16033) @wence-
Fix the int32 overflow when computing page fragment sizes for large string columns (#16028) @mhaseeb123
Fix the pool size alignment issue (#16024) @PointKernel
Improve multibyte-split byte-range performance (#16019) @davidwendt
Fix target counting in strings char-parallel replace (#16017) @davidwendt
Support IntervalDtype in cudf.from_pandas (#16014) @mroeschke
Fix memory size in createbyterangeinfosconsecutive (#16012) @davidwendt
Hide visibility of non public symbols (#15982) @robertmaynard
Fix Cython typo preventing proper inheritance (#15978) @vyasr
Fix convertdtypes with convertinteger=False/convert_floating=True (#15964) @mroeschke
Fix nunique for MultiIndex, DataFrame, and all NA case with dropna=False (#15962) @mroeschke
Explicitly build for all GPU architectures (#15959) @vyasr
Preserve column type and class information in more DataFrame operations (#15949) @mroeschke
Add array_interface to cudf.pandas numpy.ndarray proxy (#15936) @mroeschke
Allow tests to be built when stream util is disabled (#15933) @robertmaynard
Fix JSON multi-source reading when total source size exceeds INT_MAX bytes (#15930) @shrshi
Fix dask_cudf.read_parquet regression for legacy timestamp data (#15929) @rjzamora
Fix offsetalator when accessing over 268 million rows (#15921) @davidwendt
Fix debug assert in rowgroupcharcounts_kernel (#15902) @davidwendt
Fix categorical conversion from chunked arrow arrays (#15886) @vyasr
Handling for NaN and inf when converting floating point to fixed point types (#15885) @ttnghia
Manual merge of Branch 24.08 from 24.06 (#15869) @galipremsagar
Avoid unnecessary Index cast in IndexedFrame.index setter (#15843) @charlesbluca
Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
Fix multi-replace target count logic for large strings (#15807) @davidwendt
Fix JSON parsing memory corruption - Fix Mixed types nested children removal (#15798) @karthikeyann
Allow anonymous user in devcontainer name. (#15784) @bdice
Add support for additional metaclasses of proxies and use for ExcelWriter (#15399) @vyasr

📖 Documentation

Add docstring for from_dataframe (#16260) @mroeschke
Update libcudf compiler requirements in contributing doc (#16103) @davidwendt
Add libcudf public/detail API pattern to developer guide (#16086) @davidwendt
Explain line profiler and how to know which functions are GPU-accelerated. (#16079) @bdice
cudf.pandas documentation improvement (#15948) @Matt711
Reland "Fix docs for IO readers and strings_convert" (#15872)" (#15941) @lithomas1
Document how to use cudf.pandas in tandem with multiprocessing (#15940) @wence-
DOC: Add documentation for cudf.pandas in the Developer Guide (#15889) @Matt711
Improve options docs (#15888) @bdice
DOC: add linkcode to docs (#15860) @raybellwaves
DOC: use intersphinx mapping in pandas-compat ext (#15846) @raybellwaves
Fix inconsistent usage of 'results' and 'records' in read-json.md (#15766) @dagardner-nv
Update PandasCompat.py to resolve references (#15704) @raybellwaves

🚀 New Features

Warn on cuDF failure when POLARS_VERBOSE is true (#16308) @brandon-b-miller
Add drop_nulls in cudf-polars (#16290) @brandon-b-miller
[JNI] Add setKernelPinnedCopyThreshold and setPinnedAllocationThreshold (#16288) @abellina
Implement support for scan_ndjson in cudf-polars (#16263) @lithomas1
Publish cudf-polars nightlies (#16213) @lithomas1
Modify make_host_vector and make_device_uvector factories to optionally use pinned memory and kernel copy (#16206) @vuule
Migrate lists/set_operations to pylibcudf (#16190) @Matt711
Migrate lists/filling to pylibcudf (#16189) @Matt711
Fall back to CPU for unsupported libcudf binaryops in cudf-polars (#16188) @brandon-b-miller
Use resourceref for upstream in streamcheckingresourceadaptor (#16187) @harrism
Migrate lists/modifying to pylibcudf (#16185) @Matt711
Migrate lists/filtering to pylibcudf (#16184) @Matt711
Migrate lists/sorting to pylibcudf (#16179) @Matt711
Add missing methods to lists/listcolumnview.pxd in pylibcudf (#16175) @Matt711
Migrate pylibcudf lists gathering (#16170) @Matt711
Move kernel vis over to CUDF_HIDDEN (#16165) @robertmaynard
Add groupby_max multi-threaded benchmark (#16154) @srinivasyadav18
Promote hasnestedcolumns to cudf public API (#16131) @robertmaynard
Promote IO support queries to cudf API (#16125) @robertmaynard
cudf::merge public API now support passing a user stream (#16124) @robertmaynard
Add TPC-H inspired examples for Libcudf (#16088) @JayjeetAtGithub
Installed cudf header use cudf::allocate_like (#16087) @robertmaynard
cudf-polars string slicing (#16082) @brandon-b-miller
Migrate Parquet reader to pylibcudf (#16078) @lithomas1
Migrate lists/count_elements to pylibcudf (#16072) @Matt711
Migrate lists/extract to pylibcudf (#16071) @Matt711
Move common string utilities to public api (#16070) @robertmaynard
stable_distinct public api now has a stream parameter (#16068) @robertmaynard
Migrate expressions to pylibcudf (#16056) @lithomas1
Add support to ArrowDataSource in SourceInfo (#16050) @lithomas1
Experimental support for configurable prefetching (#16020) @vyasr
Migrate CSV reader to pylibcudf (#16011) @lithomas1
Migrate string slice APIs to pylibcudf (#15988) @brandon-b-miller
Migrate lists/contains to pylibcudf (#15981) @Matt711
Remove CCCL 2.2 patches as we now always use 2.5+ (#15969) @robertmaynard
Migrate JSON reader to pylibcudf (#15966) @lithomas1
Add a developer check for proxy objects (#15956) @Matt711
Start migrating I/O writers to pylibcudf (starting with JSON) (#15952) @lithomas1
Kernel copy for pinned memory (#15934) @vuule
Migrate left join and conditional join benchmarks to use nvbench (#15931) @srinivasyadav18
Migrate lists/combine to pylibcudf (#15928) @Matt711
Plumb pylibcudf strings contains_re through cudf_polars (#15918) @brandon-b-miller
Start migrating I/O to pylibcudf (#15899) @lithomas1
Pinned vector factory that uses the global pool (#15895) @vuule
Migrate strings contains operations to pylibcudf (#15880) @brandon-b-miller
Migrate quantile.pxd to pylibcudf (#15874) @lithomas1
Migrate round to pylibcudf (#15863) @lithomas1
Migrate string replace.pxd to pylibcudf (#15839) @lithomas1
Add an Environment Variable for debugging the fast path in cudf.pandas (#15837) @Matt711
Add an option to run cuIO benchmarks with pinned buffers as input (#15830) @vuule
Update pylibcudf testing utilities (#15772) @brandon-b-miller
Migrate string capitalize APIs to pylibcudf (#15503) @brandon-b-miller
Add tests for pylibcudf binaryops (#15470) @brandon-b-miller
Migrate column factories to pylibcudf (#15257) @brandon-b-miller
cuDF/libcudf exponentially weighted moving averages (#9027) @brandon-b-miller

🛠️ Improvements

Ensure objects with interface are converted to cupy/numpy arrays (#16436) @mroeschke
Add about rmm modes in cudf.pandas docs (#16404) @galipremsagar
Gracefully CUDFFAIL when `skiprows > 0` in Chunked Parquet reader (#16385) @mhaseeb123
Make C++ compilation warning free after #16297 (#16379) @wence-
Align Index init APIs with pandas 2.x (#16362) @mroeschke
Use rapidscpmbsthreadpool() (#16360) @KyleFromNVIDIA
Rename PrefetchConfig to prefetch_config. (#16358) @bdice
Implement parquet reading using pylibcudf in cudf-polars (#16346) @lithomas1
Fix compile warnings with jni_utils.hpp (#16336) @ttnghia
Align Series APIs with pandas 2.x (#16333) @mroeschke
Add missing stream param to dictionary factory APIs (#16319) @JayjeetAtGithub
Mark cudf._typing as a typing module in ruff (#16318) @mroeschke
Add stream param to list explode APIs (#16317) @JayjeetAtGithub
Fix polars for 1.2.1 (#16316) @lithomas1
Use workflow branch 24.08 again (#16314) @KyleFromNVIDIA
Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
Remove squeeze argument from groupby (#16312) @mroeschke
Align more DataFrame APIs with pandas (#16310) @mroeschke
Clean unneeded/redudant dtype utils (#16309) @mroeschke
Implement read_csv in cudf-polars using pylibcudf (#16307) @lithomas1
Use Column.cancastsafely instead of some ad-hoc dtype functions in .where (#16303) @mroeschke
Drop {{ pin_compatible('numpy', max_pin='x') }} (#16301) @jakirkham
Host implementation of to_arrow using nanoarrow (#16297) @zeroshade
Add ability to prefetch in cudf.pandas and change default to managed pool (#16296) @galipremsagar
Fix tests for polars 1.2 (#16292) @lithomas1
Introduce dedicated options for low memory readers (#16289) @galipremsagar
Remove decimal/floating 64/128bit switches due to register pressure (#16287) @pmattione-nvidia
Make ColumnAccessor strictly require a mapping of columns (#16285) @mroeschke
Introduce version file so we can conditionally handle things in tests (#16280) @wence-
Type & reduce cupy usage (#16277) @mroeschke
Update cudf::detail::grid1d to use threadindex_type (#16276) @davidwendt
Replace np.isscalar/issubdtype checks with is_scalar/.kind checks (#16275) @mroeschke
Remove xml from sortninjalog.py utility (#16274) @davidwendt
Fix issue in horizontal concat implementation in cudf-polars (#16271) @wence-
Preserve order in left join for cudf-polars (#16268) @wence-
Replace isdatetime/timedeltadtype checks with .kind checks (#16262) @mroeschke
Replace isfloat/integerdtype checks with .kind checks (#16261) @mroeschke
Build and test with CUDA 12.5.1 (#16259) @KyleFromNVIDIA
Replace isbooltype with checking .dtype.kind (#16255) @mroeschke
remove cuco_noexcept.diff (#16254) @trxcllnt
Update contains_tests.cpp to use public cudf::slice (#16253) @davidwendt
Improve the test data for pylibcudf I/O tests (#16247) @lithomas1
Short circuit some Column methods (#16246) @mroeschke
Make nvcomp adapter compatible with new version macros (#16245) @vuule
Add Column.strftime/strptime instead of overloading as_string/datetime/timedelta_column (#16243) @mroeschke
Remove temporary functor overloads required by cuco version bump (#16242) @PointKernel
Remove hashcharacterngrams dependency from jaccard_index (#16241) @davidwendt
Expose sorted groupby parameters to pylibcudf (#16240) @wence-
Expose reflection to check if casting between two types is supported (#16239) @wence-
Handle nans in groupby-aggregations in polars executor (#16233) @wence-
Remove mr param from write_csv and write_json (#16231) @JayjeetAtGithub
Support Literals in groupby-agg (#16218) @wence-
Handler csv reader options in cudf-polars (#16211) @wence-
Update vendored thread_pool implementation (#16210) @wence-
Add low memory JSON reader for cudf.pandas (#16204) @galipremsagar
Clean up state variables in MultiIndex (#16203) @mroeschke
skip CMake 3.30.0 (#16202) @jameslamb
Assert valid metadata is passed in toarrow for listview (#16198) @wence-
Expose type traits to pylibcudf (#16197) @wence-
Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
Cast count aggs to correct dtype in translation (#16192) @wence-
Some small fixes in cudf-polars (#16191) @wence-
split up CUDA-suffixed dependencies in dependencies.yaml (#16183) @jameslamb
Define PTDS for the stream hook libs (#16182) @trxcllnt
Make test_python_cudf_pandas generate requirements.txt (#16181) @trxcllnt
Add environment-agnostic ci/run_cudf_polars_pytest.sh (#16178) @trxcllnt
Implement translation for some unary functions and a single datetime extraction (#16173) @wence-
Remove size constraints on source files in batched JSON reading (#16162) @shrshi
CI: Build wheels for cudf-polars (#16156) @lithomas1
Update cudf-polars for v1 release of polars (#16149) @wence-
Use strings concatenate to support large strings in CSV writer (#16148) @davidwendt
Use verify-alpha-spec hook (#16144) @KyleFromNVIDIA
Adds write-coalescing code path optimization to FST (#16143) @elstehle
MAINT: Adapt to NumPy 2 promotion changes (#16141) @seberg
API: Check for integer overflows when creating scalar form python int (#16140) @seberg
Remove the (unused) implementation of host_parse_nested_json (#16135) @vuule
Deprecate Arrow support in I/O (#16132) @lithomas1
Disable dict support for split-page kernel in the parquet reader. (#16128) @nvdbaranec
Add throughput metrics for REDUCTIONBENCH/REDUCTIONNVBENCH benchmarks (#16126) @jihoonson
Add ensure_index to not unnecessarily shallow copy cudf.Index (#16117) @mroeschke
Make binary operators work between fixed-point and floating args (#16116) @pmattione-nvidia
Implement Ternary copyifelse (#16114) @wence-
Implement handlers for series literal in cudf-polars (#16113) @wence-
Fix dtype errors in StringArrays (#16111) @galipremsagar
Ensure MultiIndex.to_frame deep copies columns (#16110) @mroeschke
Parallelize gpuInitStringDescriptors for fixed length byte array data (#16109) @mhaseeb123
Finish implementation of cudf-polars boolean function handlers (#16098) @wence-
Expose and then implement support for cross joins in cudf-polars (#16097) @wence-
Defer copying in Column.astype(copy=True) (#16095) @mroeschke
Fix segfault in conditional join (#16094) @bdice
Free temp memory no longer needed in multibyte_split processing (#16091) @davidwendt
Rename gather/scatter benchmarks to clarify coalesced behavior. (#16083) @bdice
Adapt to polars upstream changes and turn on CI testing (#16081) @wence-
Reduce/clean copy usage in Series, reshaping (#16080) @mroeschke
Account for FIXEDLENBYTE_ARRAY when calculating fragment sizes in Parquet writer (#16064) @etseidl
Reduce (shallow) copies in DataFrame ops (#16060) @mroeschke
Add multi-file support to dask_cudf.read_json (#16057) @rjzamora
Reduce deep copies in Index ops (#16054) @mroeschke
Implement chunked column wise concat in chunked parquet reader (#16052) @galipremsagar
Add exception when trying to create large strings with cudf::test::stringscolumnwrapper (#16049) @davidwendt
Return FrozenList for Index.names (#16047) @galipremsagar
Add ast cast test (#16045) @pmattione-nvidia
Remove override_dtypes and include_index from Frame._copy_type_metadata (#16043) @mroeschke
Add ruff rules to avoid importing from typing (#16040) @mroeschke
Fix decimal -> float cast in ast code (#16038) @pmattione-nvidia
Add compile option to enable large strings support (#16037) @davidwendt
Reduce conditional_join nvbench configurations (#16036) @srinivasyadav18
Project automation update: skip if not in project (#16035) @jarmak-nv
Add stream parameter to cudf::io::text::multibyte_split (#16034) @davidwendt
Delete unused code from stringfunction evaluator (#16032) @wence-
Fix exclude regex in pre-commit clang-format hook (#16030) @wence-
Refactor rmm usage in cudf.pandas (#16021) @galipremsagar
Enable ruff TCH: typing imports under if TYPE_CHECKING (#16015) @mroeschke
Restrict the allowed pandas timezone objects in cudf (#16013) @mroeschke
orc multithreaded benchmark (#16009) @zpuller
Add tests of expression-based sort and sort-by (#16008) @wence-
Add tests of implemented StringFunctions (#16007) @wence-
Add test that diagonal concat with mismatching schemas raises (#16006) @wence-
Add coverage selecting len from a dataframe (number of rows) (#16005) @wence-
Add basic tests of dataframe scan (#16003) @wence-
Add coverage for both expression and dataframe filter (#16002) @wence-
Remove deprecated ExtContext node (#16001) @wence-
Fix typo bug in gather implementation (#16000) @wence-
Extend coverage of groupby and rolling window nodes (#15999) @wence-
Coverage of binops where one or both operands are a scalar (#15998) @wence-
Add full coverage for whole-frame Agg expressions (#15997) @wence-
Add tests covering magic methods of Expr objects (#15996) @wence-
Add full coverage of utility functions (#15995) @wence-
Test behaviour of containers (#15994) @wence-
Fix implemention of any, all, and isbetween (#15993) @wence-
Raise early on unhandled PythonScan node (#15992) @wence-
Remove mapfunction nodes that don't exist/aren't supported (#15991) @wence-
Add test coverage for slicing with "out of bounds" negative indices (#15990) @wence-
Standardize and type Series.dt methods (#15987) @mroeschke
Refactor distinct with hashset-based algorithms (#15984) @srinivasyadav18
resolve dependency-file-generator warning, remove unnecessary rapids-build-backend configuration (#15980) @jameslamb
Project automation bug fixes (#15971) @jarmak-nv
Add typing to singlecolumnframe (#15965) @mroeschke
Move some misc Frame methods to appropriate locations (#15963) @mroeschke
Condense pylibcudf data fixtures (#15958) @lithomas1
Refactor fillna logic to push specifics toward Frame subclasses and Column subclasses (#15957) @mroeschke
Remove unused parsing utilities (#15955) @vuule
Remove Scalar container type from polars interpreter (#15953) @wence-
Support arbitrary CUDA versions in UDF code (#15950) @bdice
Support large strings in cudf::io::text::multibyte_split (#15947) @davidwendt
Add external issue label and project automation (#15945) @jarmak-nv
Enable round-tripping of large strings in cudf (#15944) @galipremsagar
Add more complete type annotations in polars interpreter (#15942) @wence-
Update implementations to build with the latest cuco (#15938) @PointKernel
Support timezone aware pandas inputs in cudf (#15935) @mroeschke
Define Column.nanasnull to return self (#15923) @mroeschke
Make Frame._dtype an iterator instead of a dict (#15920) @mroeschke
Port start of datetime.hpp to pylibcudf (#15916) @wence-
Introduce NamedColumn concept in cudf-polars (#15914) @wence-
Avoid redefining Frame.getcolumnsbylabel in subclasses (#15912) @mroeschke
Templatization of fixed-width parquet decoding kernels. (#15911) @nvdbaranec
New Decimal <--> Floating conversion (#15905) @pmattione-nvidia
Use Arrow C Data Interface functions for Python interop (#15904) @vyasr
Use offsetalator in cudf::io::json::detail::parse_string (#15900) @davidwendt
Rename strings multiple target replace API (#15898) @davidwendt
Apply clang-tidy autofixes (#15894) @vyasr
Update Python labels and remove unnecessary ones (#15893) @vyasr
Clean up pylibcudf test assertations (#15892) @lithomas1
Use offsetalator in orc rowgroupcharcounts_kernel (#15891) @davidwendt
Ensure literals have correct dtype (#15890) @wence-
Add overflow check when converting large strings to lists columns (#15887) @davidwendt
Use offsetalator in nvtext::tokenizewithvocabulary (#15878) @davidwendt
Update interleave lists column for large strings (#15877) @davidwendt
Simple NumPy 2 fixes that are clearly no behavior change (#15876) @seberg
Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow (#15875) @mhaseeb123
Refactor join benchmarks to target public APIs with the default stream (#15873) @PointKernel
Fix url-decode benchmark to use offsetalator (#15871) @davidwendt
Use offsetalator in strings shift functor (#15870) @davidwendt
Memory Profiling (#15866) @madsbk
Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
Make Frame.astype return Self instead of a ColumnAccessor (#15861) @mroeschke
Use ColumnAccessor row and column length attributes more consistently (#15857) @mroeschke
add unit test setup for cudf_kafka (#15853) @jameslamb
Remove internal usage of core.index.as_index in favor of cudf.Index (#15851) @mroeschke
Ensure cudf.Series(cudf.Series(...)) creates a reference to the same index (#15845) @mroeschke
Remove benchmark-specific use of pinned-pooled memory in Parquet multithreaded benchmark. (#15838) @nvdbaranec
Implement on_bad_lines in json reader (#15834) @galipremsagar
Make Column.to_pandas return Index instead of Series (#15833) @mroeschke
Add test of interoperability of cuDF and arrow BYTESTREAMSPLIT encoders (#15832) @etseidl
Refactor Parquet writer options and builders (#15831) @etseidl
Migrate reshape.pxd to pylibcudf (#15827) @lithomas1
Remove legacy JSON reader and concurrentunorderedmap.cuh. (#15813) @bdice
Switch cuIO benchmarks to use pinned-pool host allocations by default. (#15805) @nvdbaranec
Change thrust::count_if call to raw kernel in strings split APIs (#15762) @davidwendt
Improve performance for long strings for nvtext::replace_tokens (#15756) @davidwendt
Implement chunked parquet reader in cudf-python (#15728) @galipremsagar
Add from_arrow_host functions for cudf interop with nanoarrow (#15645) @zeroshade
Add ability to enable rmm pool on cudf.pandas import (#15628) @galipremsagar
Executor for polars logical plans (#15504) @wence-
Implement dayname and monthname to match pandas (#15479) @btepera
Utilities for decimal <--> floating conversion (#15359) @pmattione-nvidia
For powers of 10, replace ipow with switch (#15353) @pmattione-nvidia
Use rapids-build-backend. (#15245) @vyasr
Add codecov coverage for pandas_tests (#14513) @galipremsagar

- C++
Published by raydouglass almost 2 years ago

https://github.com/rapidsai/cudf - v24.06.01

🚨 Breaking Changes

Deprecate Groupby.collect (#15808) @galipremsagar
Raise FileNotFoundError when a literal JSON string that looks like a json filename is passed (#15806) @lithomas1
Support filtered I/O in chunked_parquet_reader and simplify the use of parquet_reader_options (#15764) @mhaseeb123
Raise errors for unsupported operations on certain types (#15712) @galipremsagar
Support DurationType in cudf parquet reader via arrow:schema (#15617) @mhaseeb123
Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
Remove legacy JSON reader from Python (#15538) @bdice
Removing all batching code from parquet writer (#15528) @mhaseeb123
Convert libcudf resource parameters to rmm::deviceasyncresource_ref (#15507) @harrism
Remove deprecated strings offsets_begin (#15454) @davidwendt
Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
Bind read_parquet_metadata API to libcudf instead of pyarrow and extract RowGroup information (#15398) @mhaseeb123
Remove deprecated hash() and sparkmurmurhash3x86_32() (#15375) @davidwendt
Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
[FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
Align date_range defaults with pandas, support tz (#15139) @mroeschke

🐛 Bug Fixes

Backport: Use size_t to allow large conditional joins (#16127) (#16133) @bdice
Backport #16045 to 24.06 (#16102) @vyasr
Backport #16038 to 24.06 (#16101) @vyasr
Backport: Fix segfault in conditional join (#16094) (#16100) @bdice
Add patch for incorrect cuco noexcept clauses (#16077) @vyasr
Revert "Fix docs for IO readers and strings_convert" (#15872) @vyasr
Remove problematic call of index setter to unblock dask-cuda CI (#15844) @charlesbluca
Use rapidscpmnvtx3 to get same nvtx3 target state as rmm (#15840) @robertmaynard
Return boolean from confighostmemory_resource instead of throwing (#15815) @abellina
Add temporary dask-cudf workaround for categorical sorting (#15801) @rjzamora
Fix row group alignment in ORC writer (#15789) @vuule
Raise error when sorting by categorical column in dask-cudf (#15788) @rjzamora
Upgrade arrow to 16.1 (#15787) @galipremsagar
Add support for PandasArray for pandas<2.1.0 (#15786) @galipremsagar
Limit runtime dependency to libarrow>=16.0.0,<16.1.0a0 (#15782) @pentschev
Fix cat.as_ordered not propogating correct size (#15780) @mroeschke
Handle mixed-like homogeneous types in isin (#15771) @galipremsagar
Fix idvars and valuevars not accepting string scalars in melt (#15765) @mroeschke
Fix DatetimeIndex.loc for all types of ordering cases (#15761) @galipremsagar
Fix arrow versioning logic (#15755) @vyasr
Avoid running sanitizer on Java test designed to cause an error (#15753) @jlowe
Handle empty dataframe object with index present in setitem of loc (#15752) @galipremsagar
Eliminate circular reference in DataFrame/Series.iloc/loc (#15749) @mroeschke
Cap the absolute row index per pass in parquet chunked reader. (#15735) @nvdbaranec
Fix Index.repeat for datetime64 types (#15722) @galipremsagar
Fix multibyte check for case convert for large strings (#15721) @davidwendt
Fix get_loc to properly fetch results from an index that is in decreasing order (#15719) @galipremsagar
Return same type as the original index for .loc operations (#15717) @galipremsagar
Correct static builds + static arrow (#15715) @robertmaynard
Raise errors for unsupported operations on certain types (#15712) @galipremsagar
Fix ColumnAccessor caching of nrows if empty previously (#15710) @mroeschke
Allow None when nan_as_null=False in column constructor (#15709) @galipremsagar
Refine CudaTest.testCudaException in case throwing wrong type of CudaError under aarch64 (#15706) @sperlingxx
Fix maxima of categorical column (#15701) @rjzamora
Add proxy for inplace operations in cudf.pandas (#15695) @galipremsagar
Make nan_as_null behavior consistent across all APIs (#15692) @galipremsagar
Fix CI s3 api command to fetch latest results (#15687) @galipremsagar
Add NumpyExtensionArray proxy type in cudf.pandas (#15686) @galipremsagar
Properly implement binaryops for proxy types (#15684) @galipremsagar
Fix copy assignment and the comparison operator of rmm_host_allocator (#15677) @vuule
Fix multi-source reading in JSON byte range reader (#15671) @shrshi
Return int64 when pandas compatible mode is turned on for get_indexer (#15659) @galipremsagar
Fix Index contains for error validations and float vs int comparisons (#15657) @galipremsagar
Preserve sub-second data for time scalars in column construction (#15655) @galipremsagar
Check row limit size in cudf::strings::join_strings (#15643) @davidwendt
Enable sorting on column with nulls using query-planning (#15639) @rjzamora
Fix operator precedence problem in Parquet reader (#15638) @etseidl
Fix decoding of dictionary encoded FIXEDLENBYTE_ARRAY data in Parquet reader (#15601) @etseidl
Fix debug warnings/errors in fromarrowdevice_test.cpp (#15596) @davidwendt
Add "collect" aggregation support to dask-cudf (#15593) @rjzamora
Fix categorical-accessor support and testing in dask-cudf (#15591) @rjzamora
Disable compute-sanitizer usage in CI tests with CUDA<11.6 (#15584) @davidwendt
Preserve RangeIndex.step in toarrow/fromarrow (#15581) @mroeschke
Ignore new cupy warning (#15574) @vyasr
Add cuda-sanitizer-api dependency for test-cpp matrix 11.4 (#15573) @davidwendt
Allow apply udf to reference global modules in cudf.pandas (#15569) @mroeschke
Fix deprecation warnings for json legacy reader (#15563) @davidwendt
Fix millisecond resampling in cudf Python (#15560) @mroeschke
Rename JSONREADEROPTION to JSONREADEROPTION_NVBENCH. (#15553) @bdice
Fix a JNI bug in JSON parsing fixup (#15550) @revans2
Remove conda channel setup from wheel CI image script. (#15539) @bdice
cudf.pandas: Series dt accessor is CombinedDatetimelikeProperties (#15523) @wence-
Fix for some compiler warnings in parquet/page_decode.cuh (#15518) @etseidl
Fix exponent overflow in strings-to-double conversion (#15517) @davidwendt
nanoarrow uses package override for proper pinned versions generation (#15515) @robertmaynard
Remove index name overrides in dask-cudf pyarrow table dispatch (#15514) @charlesbluca
Fix async synchronization issues in json_column.cu (#15497) @karthikeyann
Add new patch to hide more CCCL APIs (#15493) @vyasr
Make improvements in pandas-test reporting (#15485) @galipremsagar
Fixed page data truncation in parquet writer under certain conditions. (#15474) @nvdbaranec
Only use data_type constructor with scale for decimal types (#15472) @wence-
Avoid "p2p" shuffle as a default when dask_cudf is imported (#15469) @rjzamora
Fix debug build errors from toarrowdevice_test.cpp (#15463) @davidwendt
Fix basenormalator::integersizeof_fn integer dispatch (#15457) @davidwendt
Allow consumers of static builds to find nanoarrow (#15456) @robertmaynard
Allow jit compilation when using a splayed CUDA toolkit (#15451) @robertmaynard
Handle case of scan aggregation in groupby-transform (#15450) @wence-
Test static builds in CI and fix nanoarrow configure (#15437) @vyasr
Fixes potential race in JSON parser when parsing JSON lines format and when recovering from invalid lines (#15419) @elstehle
Fix errors in chunked ORC writer when no tables were (successfully) written (#15393) @vuule
Support implicit array conversion with query-planning enabled (#15378) @rjzamora
Fix arrow-based round trip of empty dataframes (#15373) @wence-
Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
Remove boundscheck=False setting in cython files (#15362) @wence-
Patch dask-expr var logic in dask-cudf (#15347) @rjzamora
Fix for logical and syntactical errors in libcudf c++ examples (#15346) @mhaseeb123
Disable dask-expr in docs builds. (#15343) @bdice
Apply the cuFile error work around to data_sink as well (#15335) @vuule
Fix parquet predicate filtering with column projection (#15113) @karthikeyann
Check column type equality, handling nested types correctly. (#14531) @bdice

📖 Documentation

Fix docs for IO readers and strings_convert (#15842) @bdice
Update cudf.pandas docs for GA (#15744) @beckernick
Add contributing warning about circular imports (#15691) @er-eis
Update libcudf developer guide for strings offsets column (#15661) @davidwendt
Update developer guide with deviceasyncresource_ref guidelines (#15562) @harrism
DOC: add pandas intersphinx mapping (#15531) @raybellwaves
rm-dup-doc in frame.py (#15530) @raybellwaves
Update CONTRIBUTING.md to use latest cuda env (#15467) @raybellwaves
Doc: interleave columns pandas compat (#15383) @raybellwaves
Simplified README Examples (#15338) @wkaisertexas
Add debug tips section to libcudf developer guide (#15329) @davidwendt
Fix and clarify notes on result ordering (#13255) @shwina

🚀 New Features

Add JNI bindings for zstd compression of NVCOMP. (#15729) @firestarman
Fix spaces around CSV quoted strings (#15727) @thabetx
Add default pinned pool that falls back to new pinned allocations (#15665) @vuule
Overhaul ops-codeowners coverage (#15660) @raydouglass
Concatenate dictionary of objects along axis=1 (#15623) @er-eis
Construct pylibcudf columns from objects supporting __cuda_array_interface__ (#15615) @brandon-b-miller
Expose some Parquet per-column configuration options via the python API (#15613) @etseidl
Migrate string find operations to pylibcudf (#15604) @brandon-b-miller
Round trip FIXEDLENBYTE_ARRAY data properly in Parquet writer (#15600) @etseidl
Reading multi-line JSON in string columns using runtime configurable delimiter (#15556) @shrshi
Remove public gtest dependency from libcudf conda package (#15534) @robertmaynard
Fea/move to latest nanoarrow (#15526) @robertmaynard
Migrate string case operations to pylibcudf (#15489) @brandon-b-miller
Add Parquet encoding statistics to column chunk metadata (#15452) @etseidl
Implement JNI for chunked ORC reader (#15446) @ttnghia
Add some missing optional fields to the Parquet RowGroup metadata (#15421) @etseidl
Adding parquet transcoding example (#15420) @mhaseeb123
Add fields to Parquet Statistics structure that were added in parquet-format 2.10 (#15412) @etseidl
Add option to Parquet writer to skip compressing individual columns (#15411) @etseidl
Add BYTESTREAMSPLIT support to Parquet (#15311) @etseidl
Introduce benchmark suite for JSON reader options (#15124) @shrshi
Implement ORC chunked reader (#15094) @ttnghia
Extend cudf devcontainers to specify jitify2 kernel cache (#15068) @robertmaynard
Add to_arrow_device function to cudf interop using nanoarrow (#15047) @zeroshade
Add JSON option to prune columns (#14996) @karthikeyann

🛠️ Improvements

Deprecate Groupby.collect (#15808) @galipremsagar
Raise FileNotFoundError when a literal JSON string that looks like a json filename is passed (#15806) @lithomas1
Deprecate divisions='quantile' support in set_index (#15804) @rjzamora
Improve performance of Series.tonumpy/tocupy (#15792) @mroeschke
Access self.index instead of self._index where possible (#15781) @mroeschke
Support filtered I/O in chunked_parquet_reader and simplify the use of parquet_reader_options (#15764) @mhaseeb123
Avoid index-to-column conversion in some DataFrame ops (#15763) @mroeschke
Fix chunked_parquet_reader behavior when input has no more rows to read (#15757) @mhaseeb123
[JNI] Expose java API for cudf::io::confighostmemory_resource (#15745) @abellina
Migrate all cpp pxd files into pylibcudf (#15740) @vyasr
Validate and materialize iterators earlier in as_column (#15739) @mroeschke
Push some ascolumn arrow logic to ColumnBase.fromarrow (#15738) @mroeschke
Expose stream parameter in public reduction APIs (#15737) @srinivasyadav18
remove unnecessary 'setuptools' host dependency, simplify dependencies.yaml (#15736) @jameslamb
Defer to C++ equality and hashing for pylibcudf DataType and Aggregation objects (#15732) @wence-
Implement null-aware NOT_EQUALS binop (#15731) @wence-
Fix split-record result list column offset type (#15707) @davidwendt
Upgrade arrow to 16 (#15703) @galipremsagar
Remove experimental namespace from makestringschildren (#15702) @davidwendt
Rework getjsonobject benchmark to use nvbench (#15698) @davidwendt
Rework some python tests of Parquet delta encodings (#15693) @etseidl
Skeleton cudf polars package (#15688) @wence-
Upgrade pre commit hooks (#15685) @wence-
Allow fillna to validate for CategoricalColumn.fillna (#15683) @galipremsagar
Misc Column cleanups (#15682) @mroeschke
Reducing runtime of JSON reader options benchmark (#15681) @shrshi
Add Timestamp and Timedelta proxy types (#15680) @galipremsagar
Remove hostparsenested_json. (#15674) @bdice
Reduce runtime for ParquetChunkedReaderInputLimitTest gtests (#15672) @davidwendt
Add large-strings gtest for cudf::interleave_columns (#15669) @davidwendt
Use experimental makestringschildren for multi-replace_re (#15667) @davidwendt
Enabled Holiday types in cudf.pandas (#15664) @galipremsagar
Remove obsolete XFAIL markers for query-planning (#15662) @rjzamora
Clean up join benchmarks (#15644) @PointKernel
Enable warnings as errors in custreamz (#15642) @mroeschke
Improve distinct join with set retrieve (#15636) @PointKernel
Fix -Werror=type-limits. (#15635) @bdice
Enable FutureWarnings/DeprecationWarnings as errors for dask_cudf (#15634) @mroeschke
Remove NVBench SHA override. (#15633) @alliepiper
Add support for large string columns to Parquet reader and writer (#15632) @etseidl
Large strings support in MD5 and SHA hashers (#15631) @davidwendt
Fix makeoffsetschild_column usage in cudf::strings::detail::shift (#15630) @davidwendt
Use experimental makestringschildren for strings convert (#15629) @davidwendt
Forward-merge branch-24.04 to branch-24.06 (#15627) @bdice
Avoid accessing attributes via _column if not needed (#15624) @mroeschke
Make ColumnBase.cudaarrayinterface opt out instead of opt in (#15622) @mroeschke
Large strings support for cudf::gather (#15621) @davidwendt
Remove jni-docker-build workflow (#15619) @bdice
Support DurationType in cudf parquet reader via arrow:schema (#15617) @mhaseeb123
Drop Centos7 support (#15608) @NvTimLiu
Use experimental makestringschildren for json/csv writers (#15599) @davidwendt
Use experimental makestringschildren for strings join/url_encode/slice (#15598) @davidwendt
Use experimental makestringschildren in nvtext APIs (#15595) @davidwendt
Migrate to {{ stdlib("c") }} (#15594) @hcho3
Deprecate to/from_dask_dataframe APIs in dask-cudf (#15592) @rjzamora
Minor fixups for future NumPy 2 compatibility (#15590) @seberg
Delay materializing RangeIndex in .reset_index (#15588) @mroeschke
Use experimental makestringschildren for capitalize/case/pad functions (#15587) @davidwendt
Use experimental makestringschildren for strings replace/filter/translate (#15586) @davidwendt
Add multithreaded parquet reader benchmarks. (#15585) @nvdbaranec
Don't materialize column during RangeIndex methods (#15582) @mroeschke
Improve performance for cudf::strings::count_re (#15578) @davidwendt
Replace RangeIndex.start/stop/_step with _range (#15576) @mroeschke
add --rm and --name to devcontainer run args (#15572) @trxcllnt
Change the default dictionary policy in Parquet writer from ALWAYS to ADAPTIVE (#15570) @mhaseeb123
Rename experimental JSON tests. (#15568) @bdice
Refactor JNI native dependency loading to allow returning of library path (#15566) @jlowe
Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
Deprecate legacy JSON reader options. (#15558) @bdice
Use same .clang-format in cuDF JNI (#15557) @bdice
Large strings support for cudf::fill (#15555) @davidwendt
Upgrade upper bound pinning to pandas-2.2.2 (#15554) @galipremsagar
Work around issues with cccl main (#15552) @miscco
Enable pandas plotting unit tests for cudf.pandas (#15547) @mroeschke
Move timezone conversion logic to DatetimeColumn (#15545) @mroeschke
Large strings support for cudf::interleave_columns (#15544) @davidwendt
[skip ci] Switch back to 24.06 branch for pandas tests (#15543) @galipremsagar
Remove checks dependency from static-configure test job. (#15542) @bdice
Remove legacy JSON reader from Python (#15538) @bdice
Enable more ignored pandas unit tests for cudf.pandas (#15535) @mroeschke
Large strings support for cudf::clamp (#15533) @davidwendt
Remove version hard-coding (#15529) @galipremsagar
Removing all batching code from parquet writer (#15528) @mhaseeb123
Make some private class properties not settable (#15527) @mroeschke
Large strings support in regex replace APIs (#15524) @davidwendt
Skip pandas unit tests that crash pytest workers in cudf.pandas (#15521) @mroeschke
Preserve column metadata during more DataFrame operations (#15519) @mroeschke
Move to pandas-tests to a dedicated workflow file and trigger it from branch.yaml (#15516) @galipremsagar
Large strings gtest fixture and utilities (#15513) @davidwendt
Convert libcudf resource parameters to rmm::deviceasyncresource_ref (#15507) @harrism
Relax protobuf lower bound to 3.20. (#15506) @bdice
Clean up index methods (#15496) @mroeschke
Update strings contains benchmarks to nvbench (#15495) @davidwendt
Update NVBench fixture to use new hooks, fix pinned memory segfault. (#15492) @alliepiper
Enable tests/scalar and test/series in cudf.pandas tests (#15486) @mroeschke
Clean up cudaarrayinterface handling in as_column (#15477) @mroeschke
Avoid .ordered and .categories from being settable in CategoricalColumn and CategoricalDtype (#15475) @mroeschke
Ignore pandas tests for cudf.pandas that need motoserver (#15468) @mroeschke
Use cachedproperty for NumericColumn.nancount instead of .nancount variable (#15466) @mroeschke
Add toarrowdevice() functions that accept views (#15465) @davidwendt
Add custom status check workflow (#15464) @galipremsagar
Disable pandas 2.x clipboard tests in cudf.pandas tests (#15462) @mroeschke
Enable tests/strings/test_api.py and tests/io/pytables in cudf.pandas tests (#15461) @mroeschke
Enable test_parsing in cudf.pandas tests (#15460) @mroeschke
Add from_arrow_device function to cudf interop using nanoarrow (#15458) @zeroshade
Remove deprecated strings offsets_begin (#15454) @davidwendt
Enable tests/windows/ in cudf.pandas tests (#15444) @mroeschke
Enable tests/interchange/test_impl.py in cudf.pandas tests (#15443) @mroeschke
Enable tests/io/testuseragent.py in cudf pandas tests (#15442) @mroeschke
Performance improvement in libcudf case conversion for long strings (#15441) @davidwendt
Remove prior test skipping in run-pandas-tests with testing 2.2.1 (#15440) @mroeschke
Support orc and text IO with dask-expr using legacy conversion (#15439) @rjzamora
Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
Unify Copy-On-Write and Spilling (#15436) @madsbk
Enable dask_cudf json and s3 tests with query-planning on (#15408) @rjzamora
Bump ruff and codespell pre-commit checks (#15407) @mroeschke
Enable all tests for arm arch (#15402) @galipremsagar
Bind read_parquet_metadata API to libcudf instead of pyarrow and extract RowGroup information (#15398) @mhaseeb123
Optimizing multi-source byte range reading in JSON reader (#15396) @shrshi
add correct labels to pandasfunctionrequest.md (#15381) @raybellwaves
Remove deprecated hash() and sparkmurmurhash3x86_32() (#15375) @davidwendt
Large strings support in cudf::merge (#15374) @davidwendt
Enable test-reporting for pandas pytests in CI (#15369) @galipremsagar
Use logical types in Parquet reader (#15365) @etseidl
Add experimental makestringschildren utility (#15363) @davidwendt
Forward-merge branch-24.04 to branch-24.06 (#15349) @bdice
Fix CMake files in libcudf C++ examples to use existing libcudf build if present (#15348) @mhaseeb123
Use ruff pydocstyle over pydocstyle pre-commit hook (#15345) @mroeschke
Refactor stream mode setup for gtests (#15337) @davidwendt
Benchmark decimal <--> floating conversions. (#15334) @pmattione-nvidia
Avoid duplicate dask-cudf testing (#15333) @rjzamora
Skip decode steps in Parquet reader when nullable columns have no nulls (#15332) @etseidl
Update udfcpp to use rapidscpm_cccl. (#15331) @bdice
Forward-merge branch-24.04 into branch-24.06 skip ci @rapids-bot[bot]
Allow numeric_only=True for simple groupby reductions (#15326) @rjzamora
Drop CentOS 7 support. (#15323) @bdice
Rework cudf::findandreplaceall to use gather-based makestrings_column (#15305) @davidwendt
First pass at adding testing for pylibcudf (#15300) @vyasr
[FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
Rework cudf::replacenulls to use strings::detail::copyif_else (#15286) @davidwendt
Clean up special casing in as_column for non-typed input (#15276) @mroeschke
Large strings support in cudf::concatenate (#15195) @davidwendt
Use less iscategorical_dtype (#15148) @mroeschke
Align date_range defaults with pandas, support tz (#15139) @mroeschke
ModuleAccelerator performance: cache the result of checking if a caller is in the denylist (#15056) @shwina
Use offsetalator in cudf::strings::replace functions (#14824) @davidwendt
Cleanup some timedelta/datetime column logic (#14715) @mroeschke
Refactor numpy array input in as_column (#14651) @mroeschke
Refactor joins for conditional semis and antis (#14646) @DanialJavady96
Eagerly populate the class dict for cudf.pandas proxy types (#14534) @shwina
Some additional kernel thread index refactoring. (#14107) @bdice

- C++
Published by raydouglass almost 2 years ago

https://github.com/rapidsai/cudf - v24.06.00

🚨 Breaking Changes

Deprecate Groupby.collect (#15808) @galipremsagar
Raise FileNotFoundError when a literal JSON string that looks like a json filename is passed (#15806) @lithomas1
Support filtered I/O in chunked_parquet_reader and simplify the use of parquet_reader_options (#15764) @mhaseeb123
Raise errors for unsupported operations on certain types (#15712) @galipremsagar
Support DurationType in cudf parquet reader via arrow:schema (#15617) @mhaseeb123
Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
Remove legacy JSON reader from Python (#15538) @bdice
Removing all batching code from parquet writer (#15528) @mhaseeb123
Convert libcudf resource parameters to rmm::deviceasyncresource_ref (#15507) @harrism
Remove deprecated strings offsets_begin (#15454) @davidwendt
Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
Bind read_parquet_metadata API to libcudf instead of pyarrow and extract RowGroup information (#15398) @mhaseeb123
Remove deprecated hash() and sparkmurmurhash3x86_32() (#15375) @davidwendt
Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
[FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
Align date_range defaults with pandas, support tz (#15139) @mroeschke

🐛 Bug Fixes

Revert "Fix docs for IO readers and strings_convert" (#15872) @vyasr
Remove problematic call of index setter to unblock dask-cuda CI (#15844) @charlesbluca
Use rapidscpmnvtx3 to get same nvtx3 target state as rmm (#15840) @robertmaynard
Return boolean from confighostmemory_resource instead of throwing (#15815) @abellina
Add temporary dask-cudf workaround for categorical sorting (#15801) @rjzamora
Fix row group alignment in ORC writer (#15789) @vuule
Raise error when sorting by categorical column in dask-cudf (#15788) @rjzamora
Upgrade arrow to 16.1 (#15787) @galipremsagar
Add support for PandasArray for pandas<2.1.0 (#15786) @galipremsagar
Limit runtime dependency to libarrow>=16.0.0,<16.1.0a0 (#15782) @pentschev
Fix cat.as_ordered not propogating correct size (#15780) @mroeschke
Handle mixed-like homogeneous types in isin (#15771) @galipremsagar
Fix idvars and valuevars not accepting string scalars in melt (#15765) @mroeschke
Fix DatetimeIndex.loc for all types of ordering cases (#15761) @galipremsagar
Fix arrow versioning logic (#15755) @vyasr
Avoid running sanitizer on Java test designed to cause an error (#15753) @jlowe
Handle empty dataframe object with index present in setitem of loc (#15752) @galipremsagar
Eliminate circular reference in DataFrame/Series.iloc/loc (#15749) @mroeschke
Cap the absolute row index per pass in parquet chunked reader. (#15735) @nvdbaranec
Fix Index.repeat for datetime64 types (#15722) @galipremsagar
Fix multibyte check for case convert for large strings (#15721) @davidwendt
Fix get_loc to properly fetch results from an index that is in decreasing order (#15719) @galipremsagar
Return same type as the original index for .loc operations (#15717) @galipremsagar
Correct static builds + static arrow (#15715) @robertmaynard
Raise errors for unsupported operations on certain types (#15712) @galipremsagar
Fix ColumnAccessor caching of nrows if empty previously (#15710) @mroeschke
Allow None when nan_as_null=False in column constructor (#15709) @galipremsagar
Refine CudaTest.testCudaException in case throwing wrong type of CudaError under aarch64 (#15706) @sperlingxx
Fix maxima of categorical column (#15701) @rjzamora
Add proxy for inplace operations in cudf.pandas (#15695) @galipremsagar
Make nan_as_null behavior consistent across all APIs (#15692) @galipremsagar
Fix CI s3 api command to fetch latest results (#15687) @galipremsagar
Add NumpyExtensionArray proxy type in cudf.pandas (#15686) @galipremsagar
Properly implement binaryops for proxy types (#15684) @galipremsagar
Fix copy assignment and the comparison operator of rmm_host_allocator (#15677) @vuule
Fix multi-source reading in JSON byte range reader (#15671) @shrshi
Return int64 when pandas compatible mode is turned on for get_indexer (#15659) @galipremsagar
Fix Index contains for error validations and float vs int comparisons (#15657) @galipremsagar
Preserve sub-second data for time scalars in column construction (#15655) @galipremsagar
Check row limit size in cudf::strings::join_strings (#15643) @davidwendt
Enable sorting on column with nulls using query-planning (#15639) @rjzamora
Fix operator precedence problem in Parquet reader (#15638) @etseidl
Fix decoding of dictionary encoded FIXEDLENBYTE_ARRAY data in Parquet reader (#15601) @etseidl
Fix debug warnings/errors in fromarrowdevice_test.cpp (#15596) @davidwendt
Add "collect" aggregation support to dask-cudf (#15593) @rjzamora
Fix categorical-accessor support and testing in dask-cudf (#15591) @rjzamora
Disable compute-sanitizer usage in CI tests with CUDA<11.6 (#15584) @davidwendt
Preserve RangeIndex.step in toarrow/fromarrow (#15581) @mroeschke
Ignore new cupy warning (#15574) @vyasr
Add cuda-sanitizer-api dependency for test-cpp matrix 11.4 (#15573) @davidwendt
Allow apply udf to reference global modules in cudf.pandas (#15569) @mroeschke
Fix deprecation warnings for json legacy reader (#15563) @davidwendt
Fix millisecond resampling in cudf Python (#15560) @mroeschke
Rename JSONREADEROPTION to JSONREADEROPTION_NVBENCH. (#15553) @bdice
Fix a JNI bug in JSON parsing fixup (#15550) @revans2
Remove conda channel setup from wheel CI image script. (#15539) @bdice
cudf.pandas: Series dt accessor is CombinedDatetimelikeProperties (#15523) @wence-
Fix for some compiler warnings in parquet/page_decode.cuh (#15518) @etseidl
Fix exponent overflow in strings-to-double conversion (#15517) @davidwendt
nanoarrow uses package override for proper pinned versions generation (#15515) @robertmaynard
Remove index name overrides in dask-cudf pyarrow table dispatch (#15514) @charlesbluca
Fix async synchronization issues in json_column.cu (#15497) @karthikeyann
Add new patch to hide more CCCL APIs (#15493) @vyasr
Make improvements in pandas-test reporting (#15485) @galipremsagar
Fixed page data truncation in parquet writer under certain conditions. (#15474) @nvdbaranec
Only use data_type constructor with scale for decimal types (#15472) @wence-
Avoid "p2p" shuffle as a default when dask_cudf is imported (#15469) @rjzamora
Fix debug build errors from toarrowdevice_test.cpp (#15463) @davidwendt
Fix basenormalator::integersizeof_fn integer dispatch (#15457) @davidwendt
Allow consumers of static builds to find nanoarrow (#15456) @robertmaynard
Allow jit compilation when using a splayed CUDA toolkit (#15451) @robertmaynard
Handle case of scan aggregation in groupby-transform (#15450) @wence-
Test static builds in CI and fix nanoarrow configure (#15437) @vyasr
Fixes potential race in JSON parser when parsing JSON lines format and when recovering from invalid lines (#15419) @elstehle
Fix errors in chunked ORC writer when no tables were (successfully) written (#15393) @vuule
Support implicit array conversion with query-planning enabled (#15378) @rjzamora
Fix arrow-based round trip of empty dataframes (#15373) @wence-
Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
Remove boundscheck=False setting in cython files (#15362) @wence-
Patch dask-expr var logic in dask-cudf (#15347) @rjzamora
Fix for logical and syntactical errors in libcudf c++ examples (#15346) @mhaseeb123
Disable dask-expr in docs builds. (#15343) @bdice
Apply the cuFile error work around to data_sink as well (#15335) @vuule
Fix parquet predicate filtering with column projection (#15113) @karthikeyann
Check column type equality, handling nested types correctly. (#14531) @bdice

📖 Documentation

Fix docs for IO readers and strings_convert (#15842) @bdice
Update cudf.pandas docs for GA (#15744) @beckernick
Add contributing warning about circular imports (#15691) @er-eis
Update libcudf developer guide for strings offsets column (#15661) @davidwendt
Update developer guide with deviceasyncresource_ref guidelines (#15562) @harrism
DOC: add pandas intersphinx mapping (#15531) @raybellwaves
rm-dup-doc in frame.py (#15530) @raybellwaves
Update CONTRIBUTING.md to use latest cuda env (#15467) @raybellwaves
Doc: interleave columns pandas compat (#15383) @raybellwaves
Simplified README Examples (#15338) @wkaisertexas
Add debug tips section to libcudf developer guide (#15329) @davidwendt
Fix and clarify notes on result ordering (#13255) @shwina

🚀 New Features

Add JNI bindings for zstd compression of NVCOMP. (#15729) @firestarman
Fix spaces around CSV quoted strings (#15727) @thabetx
Add default pinned pool that falls back to new pinned allocations (#15665) @vuule
Overhaul ops-codeowners coverage (#15660) @raydouglass
Concatenate dictionary of objects along axis=1 (#15623) @er-eis
Construct pylibcudf columns from objects supporting __cuda_array_interface__ (#15615) @brandon-b-miller
Expose some Parquet per-column configuration options via the python API (#15613) @etseidl
Migrate string find operations to pylibcudf (#15604) @brandon-b-miller
Round trip FIXEDLENBYTE_ARRAY data properly in Parquet writer (#15600) @etseidl
Reading multi-line JSON in string columns using runtime configurable delimiter (#15556) @shrshi
Remove public gtest dependency from libcudf conda package (#15534) @robertmaynard
Fea/move to latest nanoarrow (#15526) @robertmaynard
Migrate string case operations to pylibcudf (#15489) @brandon-b-miller
Add Parquet encoding statistics to column chunk metadata (#15452) @etseidl
Implement JNI for chunked ORC reader (#15446) @ttnghia
Add some missing optional fields to the Parquet RowGroup metadata (#15421) @etseidl
Adding parquet transcoding example (#15420) @mhaseeb123
Add fields to Parquet Statistics structure that were added in parquet-format 2.10 (#15412) @etseidl
Add option to Parquet writer to skip compressing individual columns (#15411) @etseidl
Add BYTESTREAMSPLIT support to Parquet (#15311) @etseidl
Introduce benchmark suite for JSON reader options (#15124) @shrshi
Implement ORC chunked reader (#15094) @ttnghia
Extend cudf devcontainers to specify jitify2 kernel cache (#15068) @robertmaynard
Add to_arrow_device function to cudf interop using nanoarrow (#15047) @zeroshade
Add JSON option to prune columns (#14996) @karthikeyann

🛠️ Improvements

Deprecate Groupby.collect (#15808) @galipremsagar
Raise FileNotFoundError when a literal JSON string that looks like a json filename is passed (#15806) @lithomas1
Deprecate divisions='quantile' support in set_index (#15804) @rjzamora
Improve performance of Series.tonumpy/tocupy (#15792) @mroeschke
Access self.index instead of self._index where possible (#15781) @mroeschke
Support filtered I/O in chunked_parquet_reader and simplify the use of parquet_reader_options (#15764) @mhaseeb123
Avoid index-to-column conversion in some DataFrame ops (#15763) @mroeschke
Fix chunked_parquet_reader behavior when input has no more rows to read (#15757) @mhaseeb123
[JNI] Expose java API for cudf::io::confighostmemory_resource (#15745) @abellina
Migrate all cpp pxd files into pylibcudf (#15740) @vyasr
Validate and materialize iterators earlier in as_column (#15739) @mroeschke
Push some ascolumn arrow logic to ColumnBase.fromarrow (#15738) @mroeschke
Expose stream parameter in public reduction APIs (#15737) @srinivasyadav18
remove unnecessary 'setuptools' host dependency, simplify dependencies.yaml (#15736) @jameslamb
Defer to C++ equality and hashing for pylibcudf DataType and Aggregation objects (#15732) @wence-
Implement null-aware NOT_EQUALS binop (#15731) @wence-
Fix split-record result list column offset type (#15707) @davidwendt
Upgrade arrow to 16 (#15703) @galipremsagar
Remove experimental namespace from makestringschildren (#15702) @davidwendt
Rework getjsonobject benchmark to use nvbench (#15698) @davidwendt
Rework some python tests of Parquet delta encodings (#15693) @etseidl
Skeleton cudf polars package (#15688) @wence-
Upgrade pre commit hooks (#15685) @wence-
Allow fillna to validate for CategoricalColumn.fillna (#15683) @galipremsagar
Misc Column cleanups (#15682) @mroeschke
Reducing runtime of JSON reader options benchmark (#15681) @shrshi
Add Timestamp and Timedelta proxy types (#15680) @galipremsagar
Remove hostparsenested_json. (#15674) @bdice
Reduce runtime for ParquetChunkedReaderInputLimitTest gtests (#15672) @davidwendt
Add large-strings gtest for cudf::interleave_columns (#15669) @davidwendt
Use experimental makestringschildren for multi-replace_re (#15667) @davidwendt
Enabled Holiday types in cudf.pandas (#15664) @galipremsagar
Remove obsolete XFAIL markers for query-planning (#15662) @rjzamora
Clean up join benchmarks (#15644) @PointKernel
Enable warnings as errors in custreamz (#15642) @mroeschke
Improve distinct join with set retrieve (#15636) @PointKernel
Fix -Werror=type-limits. (#15635) @bdice
Enable FutureWarnings/DeprecationWarnings as errors for dask_cudf (#15634) @mroeschke
Remove NVBench SHA override. (#15633) @alliepiper
Add support for large string columns to Parquet reader and writer (#15632) @etseidl
Large strings support in MD5 and SHA hashers (#15631) @davidwendt
Fix makeoffsetschild_column usage in cudf::strings::detail::shift (#15630) @davidwendt
Use experimental makestringschildren for strings convert (#15629) @davidwendt
Forward-merge branch-24.04 to branch-24.06 (#15627) @bdice
Avoid accessing attributes via _column if not needed (#15624) @mroeschke
Make ColumnBase.cudaarrayinterface opt out instead of opt in (#15622) @mroeschke
Large strings support for cudf::gather (#15621) @davidwendt
Remove jni-docker-build workflow (#15619) @bdice
Support DurationType in cudf parquet reader via arrow:schema (#15617) @mhaseeb123
Drop Centos7 support (#15608) @NvTimLiu
Use experimental makestringschildren for json/csv writers (#15599) @davidwendt
Use experimental makestringschildren for strings join/url_encode/slice (#15598) @davidwendt
Use experimental makestringschildren in nvtext APIs (#15595) @davidwendt
Migrate to {{ stdlib("c") }} (#15594) @hcho3
Deprecate to/from_dask_dataframe APIs in dask-cudf (#15592) @rjzamora
Minor fixups for future NumPy 2 compatibility (#15590) @seberg
Delay materializing RangeIndex in .reset_index (#15588) @mroeschke
Use experimental makestringschildren for capitalize/case/pad functions (#15587) @davidwendt
Use experimental makestringschildren for strings replace/filter/translate (#15586) @davidwendt
Add multithreaded parquet reader benchmarks. (#15585) @nvdbaranec
Don't materialize column during RangeIndex methods (#15582) @mroeschke
Improve performance for cudf::strings::count_re (#15578) @davidwendt
Replace RangeIndex.start/stop/_step with _range (#15576) @mroeschke
add --rm and --name to devcontainer run args (#15572) @trxcllnt
Change the default dictionary policy in Parquet writer from ALWAYS to ADAPTIVE (#15570) @mhaseeb123
Rename experimental JSON tests. (#15568) @bdice
Refactor JNI native dependency loading to allow returning of library path (#15566) @jlowe
Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
Deprecate legacy JSON reader options. (#15558) @bdice
Use same .clang-format in cuDF JNI (#15557) @bdice
Large strings support for cudf::fill (#15555) @davidwendt
Upgrade upper bound pinning to pandas-2.2.2 (#15554) @galipremsagar
Work around issues with cccl main (#15552) @miscco
Enable pandas plotting unit tests for cudf.pandas (#15547) @mroeschke
Move timezone conversion logic to DatetimeColumn (#15545) @mroeschke
Large strings support for cudf::interleave_columns (#15544) @davidwendt
[skip ci] Switch back to 24.06 branch for pandas tests (#15543) @galipremsagar
Remove checks dependency from static-configure test job. (#15542) @bdice
Remove legacy JSON reader from Python (#15538) @bdice
Enable more ignored pandas unit tests for cudf.pandas (#15535) @mroeschke
Large strings support for cudf::clamp (#15533) @davidwendt
Remove version hard-coding (#15529) @galipremsagar
Removing all batching code from parquet writer (#15528) @mhaseeb123
Make some private class properties not settable (#15527) @mroeschke
Large strings support in regex replace APIs (#15524) @davidwendt
Skip pandas unit tests that crash pytest workers in cudf.pandas (#15521) @mroeschke
Preserve column metadata during more DataFrame operations (#15519) @mroeschke
Move to pandas-tests to a dedicated workflow file and trigger it from branch.yaml (#15516) @galipremsagar
Large strings gtest fixture and utilities (#15513) @davidwendt
Convert libcudf resource parameters to rmm::deviceasyncresource_ref (#15507) @harrism
Relax protobuf lower bound to 3.20. (#15506) @bdice
Clean up index methods (#15496) @mroeschke
Update strings contains benchmarks to nvbench (#15495) @davidwendt
Update NVBench fixture to use new hooks, fix pinned memory segfault. (#15492) @alliepiper
Enable tests/scalar and test/series in cudf.pandas tests (#15486) @mroeschke
Clean up cudaarrayinterface handling in as_column (#15477) @mroeschke
Avoid .ordered and .categories from being settable in CategoricalColumn and CategoricalDtype (#15475) @mroeschke
Ignore pandas tests for cudf.pandas that need motoserver (#15468) @mroeschke
Use cachedproperty for NumericColumn.nancount instead of .nancount variable (#15466) @mroeschke
Add toarrowdevice() functions that accept views (#15465) @davidwendt
Add custom status check workflow (#15464) @galipremsagar
Disable pandas 2.x clipboard tests in cudf.pandas tests (#15462) @mroeschke
Enable tests/strings/test_api.py and tests/io/pytables in cudf.pandas tests (#15461) @mroeschke
Enable test_parsing in cudf.pandas tests (#15460) @mroeschke
Add from_arrow_device function to cudf interop using nanoarrow (#15458) @zeroshade
Remove deprecated strings offsets_begin (#15454) @davidwendt
Enable tests/windows/ in cudf.pandas tests (#15444) @mroeschke
Enable tests/interchange/test_impl.py in cudf.pandas tests (#15443) @mroeschke
Enable tests/io/testuseragent.py in cudf pandas tests (#15442) @mroeschke
Performance improvement in libcudf case conversion for long strings (#15441) @davidwendt
Remove prior test skipping in run-pandas-tests with testing 2.2.1 (#15440) @mroeschke
Support orc and text IO with dask-expr using legacy conversion (#15439) @rjzamora
Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
Unify Copy-On-Write and Spilling (#15436) @madsbk
Enable dask_cudf json and s3 tests with query-planning on (#15408) @rjzamora
Bump ruff and codespell pre-commit checks (#15407) @mroeschke
Enable all tests for arm arch (#15402) @galipremsagar
Bind read_parquet_metadata API to libcudf instead of pyarrow and extract RowGroup information (#15398) @mhaseeb123
Optimizing multi-source byte range reading in JSON reader (#15396) @shrshi
add correct labels to pandasfunctionrequest.md (#15381) @raybellwaves
Remove deprecated hash() and sparkmurmurhash3x86_32() (#15375) @davidwendt
Large strings support in cudf::merge (#15374) @davidwendt
Enable test-reporting for pandas pytests in CI (#15369) @galipremsagar
Use logical types in Parquet reader (#15365) @etseidl
Add experimental makestringschildren utility (#15363) @davidwendt
Forward-merge branch-24.04 to branch-24.06 (#15349) @bdice
Fix CMake files in libcudf C++ examples to use existing libcudf build if present (#15348) @mhaseeb123
Use ruff pydocstyle over pydocstyle pre-commit hook (#15345) @mroeschke
Refactor stream mode setup for gtests (#15337) @davidwendt
Benchmark decimal <--> floating conversions. (#15334) @pmattione-nvidia
Avoid duplicate dask-cudf testing (#15333) @rjzamora
Skip decode steps in Parquet reader when nullable columns have no nulls (#15332) @etseidl
Update udfcpp to use rapidscpm_cccl. (#15331) @bdice
Forward-merge branch-24.04 into branch-24.06 skip ci @rapids-bot[bot]
Allow numeric_only=True for simple groupby reductions (#15326) @rjzamora
Drop CentOS 7 support. (#15323) @bdice
Rework cudf::findandreplaceall to use gather-based makestrings_column (#15305) @davidwendt
First pass at adding testing for pylibcudf (#15300) @vyasr
[FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
Rework cudf::replacenulls to use strings::detail::copyif_else (#15286) @davidwendt
Clean up special casing in as_column for non-typed input (#15276) @mroeschke
Large strings support in cudf::concatenate (#15195) @davidwendt
Use less iscategorical_dtype (#15148) @mroeschke
Align date_range defaults with pandas, support tz (#15139) @mroeschke
ModuleAccelerator performance: cache the result of checking if a caller is in the denylist (#15056) @shwina
Use offsetalator in cudf::strings::replace functions (#14824) @davidwendt
Cleanup some timedelta/datetime column logic (#14715) @mroeschke
Refactor numpy array input in as_column (#14651) @mroeschke
Refactor joins for conditional semis and antis (#14646) @DanialJavady96
Eagerly populate the class dict for cudf.pandas proxy types (#14534) @shwina
Some additional kernel thread index refactoring. (#14107) @bdice

- C++
Published by raydouglass almost 2 years ago

https://github.com/rapidsai/cudf - v24.04.01

🚨 Breaking Changes

Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
Change exceptions thrown by copying APIs (#15319) @vyasr
Change stringscolumnview::char_size to return int64 (#15197) @davidwendt
Upgrade to arrow-14.0.2 (#15108) @galipremsagar
Add support for pandas-2.2 in cudf (#15100) @galipremsagar
Deprecate cudf::hashing::sparkmurmurhash3x86_32 (#15074) @davidwendt
Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
Raise an error on import for unsupported GPUs. (#15053) @bdice
Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
Add future_stack to DataFrame.stack (#15015) @galipremsagar
Deprecate groupby fillna (#15000) @mroeschke
Deprecate replace with categorical columns (#14988) @mroeschke
Deprecate delimwhitespace in readcsv for pandas 2.2 (#14986) @mroeschke
Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
Add pandas-2.x support in cudf (#14916) @galipremsagar
Use cuco::static_set in the hash-based groupby (#14813) @PointKernel

🐛 Bug Fixes

Fix an issue with creating a series from scalar when dtype='category' (#15476) @galipremsagar
Update pre-commit-hooks to v0.0.3 (#15355) @KyleFromNVIDIA
[BUG][JNI] Trigger MemoryBuffer.onClosed after memory is freed (#15351) @abellina
Fix an issue with multiple short list rowgroups using the Parquet chunked reader. (#15342) @nvdbaranec
Avoid importing dask-expr if "query-planning" config is False (#15340) @rjzamora
Fix gtests/ERROR_TEST errors when run in Debug (#15317) @davidwendt
Fix OOB read in inflate_kernel (#15309) @vuule
Work around a cuFile error when running CSV tests with memcheck (#15293) @vuule
Fix Doxygen upload directory (#15291) @KyleFromNVIDIA
Fix Doxygen check (#15289) @KyleFromNVIDIA
Reintroduce PANDASGE220 import (#15287) @wence-
Fix mean computation for the geometric distribution in the data generator (#15282) @vuule
Fix Parquet decimal64 stats (#15281) @etseidl
Make linking of nvtx3-cpp BUILDLOCALINTERFACE (#15271) @KyleFromNVIDIA
Workaround compute-sanitizer memcheck bug (#15259) @davidwendt
Cleanup hostdevice_vector and add more APIs (#15252) @ttnghia
Fix number of rows in randomly generated lists columns (#15248) @vuule
Fix wrong output for collect_list/collect_set of lists column (#15243) @ttnghia
Fix testchunkedPackTwoPasses to copy from the bounce buffer (#15220) @abellina
Fix accessing .columns by an external API (#15212) @galipremsagar
[JNI] Disable testChunkedPackTwoPasses for now (#15210) @abellina
Update labeler and codeowner configs for CMake files (#15208) @PointKernel
Avoid dict normalization in __dask_tokenize__ (#15187) @rjzamora
Fix memcheck error in distinct inner join (#15164) @PointKernel
Remove unneeded script parameters in testcppmemcheck.sh (#15158) @davidwendt
Fix ListColumn.to_pandas() to retain list type (#15155) @galipremsagar
Avoid factorization in MultiIndex.to_pandas (#15150) @mroeschke
Fix GroupBy.get_group and GroupBy.indices (#15143) @wence-
Remove const from range_window_bounds::_extent. (#15138) @mythrocks
DataFrame.columns = ... retains RangeIndex & set dtype (#15129) @mroeschke
Correctly handle output for GroupBy.apply when chunk results are reindexed series (#15109) @brandon-b-miller
Fix Series.groupby.shift with a MultiIndex (#15098) @mroeschke
Fix reductions when DataFrame has MulitIndex columns (#15097) @mroeschke
Fix deprecation warnings for deprecated hash() calls (#15095) @davidwendt
Add support for arrow large_string in cudf (#15093) @galipremsagar
Fix sort_values pytest failure with pandas-2.x regression (#15092) @galipremsagar
Resolve path parsing issues in get_json_object (#15082) @SurajAralihalli
Fix bugs in handling of delta encodings (#15075) @etseidl
Fix is_device_write_preferred in void_sink and user_sink_wrapper (#15064) @vuule
Eliminate duplicate allocation of nested string columns (#15061) @vuule
Raise an error on import for unsupported GPUs. (#15053) @bdice
Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
Fix Index.difference to handle duplicate values when one of the inputs is empty (#15016) @galipremsagar
Add future_stack to DataFrame.stack (#15015) @galipremsagar
Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
Fix DataFrame.sort_index to respect ignore_index on all axis (#14995) @galipremsagar
Raise for pyarrow array that is tz-aware (#14980) @mroeschke
Direct SeriesGroupBy.aggregate to SeriesGroupBy.agg (#14971) @rjzamora
Respect IntervalDtype and CategoricalDtype objects passed by users (#14961) @mroeschke
unset CUDF_SPILL after a pytest (#14958) @galipremsagar
Fix Null literals to be not parsed as string when mixed types as string is enabled in JSON reader (#14939) @karthikeyann
Fix chunked reads of Parquet delta encoded pages (#14921) @etseidl
Fix reading offset for data stream in ORC reader (#14911) @ttnghia
Enable sanitizer check for a test case testORCReadAndWriteForDecimal128 (#14897) @res-life
Fix dask token normalization (#14829) @rjzamora
Fix 24.04 versions (#14825) @raydouglass
Ensure slow private attrs are maybe proxies (#14380) @mroeschke

📖 Documentation

Ignore DLManagedTensor in the docs build (#15392) @davidwendt
Revert "Temporarily disable docs errors. (#15265)" (#15269) @bdice
Temporarily disable docs errors. (#15265) @bdice
Update developer_guide.md with new guidance on quoted internal includes (#15238) @harrism
Fix broken link for developer guide (#15025) @sanjana098
[DOC] Update typo in docs example of structscolumnwrapper (#14949) @karthikeyann
Update cudf.pandas FAQ. (#14940) @bdice
Optimize doc builds (#14856) @vyasr
Add developer guideline to use east const. (#14836) @bdice
Document how cuDF is pronounced (#14753) @pentschev
Notes convert to Pandas-compat (#12641) @Touutae-lab

🚀 New Features

Address inconsistency in single quote normalization in JSON reader (#15324) @shrshi
Use JNI pinned pool resource with cuIO (#15255) @abellina
Add DELTABYTEARRAY encoder for Parquet (#15239) @etseidl
Migrate filling operations to pylibcudf (#15225) @brandon-b-miller
[JNI] rmm based pinned pool (#15219) @abellina
Implement zero-copy host buffer source instead of using an arrow implementation (#15189) @vuule
Enable creation of columns from scalar (#15181) @vyasr
Use NVTX from GitHub. (#15178) @bdice
Implement segmented_row_bit_count for computing row sizes by segments of rows (#15169) @ttnghia
Implement search using pylibcudf (#15166) @vyasr
Add distinct left join (#15149) @PointKernel
Add cardinality control for groupby benchs with flat types (#15134) @PointKernel
Add ability to request Parquet encodings on a per-column basis (#15081) @etseidl
Automate include grouping order in .clang-format (#15063) @harrism
Requesting a clean build directory also clears Jitify cache (#15052) @robertmaynard
API for JSON unquoted whitespace normalization (#15033) @shrshi
Implement concatenate, lists.explode, merge, sorting, and stream compaction in pylibcudf (#15011) @vyasr
Implement replace in pylibcudf (#15005) @vyasr
Add distinct key inner join (#14990) @PointKernel
Implement rolling in pylibcudf (#14982) @vyasr
Implement joins in pylibcudf (#14972) @vyasr
Implement scans and reductions in pylibcudf (#14970) @vyasr
Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
Implement groupby in pylibcudf (#14945) @vyasr
Support casting of Map type to string in JSON reader (#14936) @karthikeyann
POC for whitespace removal in input JSON data using FST (#14931) @shrshi
Support for LZ4 compression in ORC and Parquet (#14906) @vuule
Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
Migrate unary operations to pylibcudf (#14850) @vyasr
Migrate binary operations to pylibcudf (#14821) @vyasr
Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
Support CUDA 12.2 (#14712) @jameslamb

🛠️ Improvements

Backport: Relax protobuf lower bound to 3.20. (#15506) (#15610) @bdice
Use conda env create --yes instead of --force (#15403) @bdice
Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
Change exceptions thrown by copying APIs (#15319) @vyasr
Enable branch testing for cudf.pandas (#15316) @galipremsagar
Replace black with ruff-format (#15312) @mroeschke
This fixes an NPE when trying to read empty JSON data by adding a new API for missing information (#15307) @revans2
Address poor performance of Parquet string decoding (#15304) @etseidl
Update script input name (#15301) @AyodeAwe
Make testreadparquetpartitionedfiltered data deterministic (#15296) @mroeschke
Add timeout for cudf.pandas pandas tests (#15284) @galipremsagar
Add upper bound to prevent usage of NumPy 2 (#15283) @bdice
Fix cudf::test::tohost return of hostvector (#15263) @davidwendt
Implement grouped product scan (#15254) @wence-
Add CUDA 12.4 to supported PTX versions (#15247) @brandon-b-miller
Implement DataFrame|Series.squeeze (#15244) @mroeschke
Roll back ipow changes due to register pressure. (#15242) @pmattione-nvidia
Remove createcharschild_column utility (#15241) @davidwendt
Update dlpack to version 0.8 (#15237) @dantegd
Improve performance in JSON reader when mixed_types_as_string option is enabled (#15236) @shrshi
Remove row conversion code from libcudf (#15234) @ttnghia
Use variable substitution for RAPIDS version in Doxyfile (#15231) @KyleFromNVIDIA
Add ListColumns.topandas(arrowtype=) (#15228) @mroeschke
Treat dask-cudf CI artifacts as pure wheels (#15223) @bdice
Clean up usage of CUDA_ARCH and other macros. (#15218) @bdice
DOC: use constants in performance-comparisons.ipynb (#15215) @raybellwaves
Rewrite conversion in terms of column (#15213) @vyasr
Switch pytest-xdist algo to worksteal (#15207) @galipremsagar
Deprecate stringscolumnview::offsets_begin() (#15205) @davidwendt
Add get_upstream_resource method to stream_checking_resource_adaptor (#15203) @miscco
Tune up row size estimation in the data generator (#15202) @vuule
Fix offset value for generating test data in parquet_chunked_reader_test.cu (#15200) @ttnghia
Change stringscolumnview::char_size to return int64 (#15197) @davidwendt
Fix includes for row_operators.cuh (#15194) @davidwendt
Generalize GHA selectors for pure Python testing (#15191) @bdice
Improvements for __cuda_array_interface__ tests (#15188) @bdice
Allow to_pandas to return pandas.ArrowDtype (#15182) @mroeschke
Ignore byte_range in read_json when the size is not smaller than the input data (#15180) @vuule
Expose new stablesort and finish streamcompaction in pylibcudf (#15175) @wence-
[ci] update matrix filters for dask-cudf builds (#15174) @jameslamb
Change makestringschildren to return uvector (#15171) @davidwendt
Don't override to_pandas for Datelike columns (#15167) @mroeschke
Drop python-snappy from dependencies. (#15161) @bdice
Add microkernels for fixed-width and fixed-width dictionary in Parquet decode (#15159) @abellina
Make HostColumnVector.DataType accessor methods public (#15157) @jbrennan333
Java bindings for left outer distinct join (#15154) @jlowe
Forward-merge branch-24.02 to branch-24.04 (#15153) @bdice
Enable pandas pytests for cudf.pandas (#15147) @galipremsagar
Add java option to keep quotes for JSON reads (#15146) @revans2
Change cross-pandas-version testing in cudf (#15145) @galipremsagar
Use hostdevice_vector in kernel_error to avoid the pageable copy (#15140) @vuule
Clean up Columns.astype & cudf.dtype (#15125) @mroeschke
Simplify some to_pandas implementations (#15123) @mroeschke
Java: Add leak tracking for Scalar instances (#15121) @jlowe
Remove calls to stringscolumnview::offsets_begin() (#15112) @davidwendt
Add support for Python 3.11, require NumPy 1.23+ (#15111) @jameslamb
Compile-time ipow computation with array lookup (#15110) @pmattione-nvidia
Upgrade to arrow-14.0.2 (#15108) @galipremsagar
Dynamically set version in RAPIDS doc builds (#15101) @jakirkham
Add support for pandas-2.2 in cudf (#15100) @galipremsagar
Update devcontainers to CUDA Toolkit 12.2 (#15099) @trxcllnt
Fix datetime binop pytest failures in pandas-2.2 (#15090) @galipremsagar
Validate types in pylibcudf Column/Table constructors (#15088) @wence-
xfail testjoinorderingpandascompat for pandas 2.2 (#15080) @mroeschke
Add general purpose host memory allocator reference to cuIO with a demo of pooled-pinned allocation. (#15079) @nvdbaranec
Adjust test_binops for pandas 2.2 (#15078) @mroeschke
Remove offsetsbegin() call from nvtext::generatengrams (#15077) @davidwendt
Use offsetalator in cudf::detail::hasnonemptynull_rows (#15076) @davidwendt
Deprecate cudf::hashing::sparkmurmurhash3x86_32 (#15074) @davidwendt
Fix cudf::test::to_host to handle both offset types for strings columns (#15073) @davidwendt
Add condition for testgroupbynulls_basic in pandas 2.2 (#15072) @mroeschke
xfail tests in testudfmasked_ops due to pandas 2.2 bug (#15071) @mroeschke
target branch-24.04 for GitHub Actions workflows (#15069) @jameslamb
Implement stable version of cudf::sort (#15066) @wence-
Fix ORC and JSON tests failures for pandas 2.2 (#15062) @mroeschke
Adjust test_joining for pandas 2.2 (#15060) @mroeschke
Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
Fix test_resample index dtype checking for pandas 2.2 (#15058) @mroeschke
Split out strings/replace.cu and rework its gtests (#15054) @davidwendt
Avoid incompatible value type setting in test_rolling for pandas 2.2 (#15050) @mroeschke
Change chained replace inplace test to COW test for pandas 2.2 (#15049) @mroeschke
Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
Avoid chained indexing in test_indexing for pandas 2.2 (#15045) @mroeschke
Avoid pandas 2.2 DeprecationWarning in test_hdf (#15044) @mroeschke
Use appropriate makeoffsetschild_column for building lists columns (#15043) @davidwendt
Factor out position-offsets logic from strings split_helper utility (#15040) @davidwendt
Forward-merge branch-24.02 to branch-24.04 (#15039) @bdice
Clean up nvtx macros (#15038) @PointKernel
Add xfailures for test_applymap for pandas 2.2 (#15034) @mroeschke
Expose libcudf filter expression in read_parquet (#15028) @wence-
Adjust tests in test_dataframe.py for pandas 2.2 (#15023) @mroeschke
Adjust testdatetimeinfer_format for pandas 2.2 (#15021) @mroeschke
Performance optimizations for parquet sub-rowgroup reader. (#15020) @nvdbaranec
JNI bindings for distincthashjoin (#15019) @jlowe
Change copyifsafe to call thrust instead of the overload function (#15018) @davidwendt
Improve performance of copyifelse for long strings (#15017) @davidwendt
Fix isstringdtype test for pandas 2.2 (#15012) @mroeschke
Rework cudf::strings::detail::copy_range for offsetalator (#15010) @davidwendt
Use offsetalator in cudf::getjsonobject() (#15009) @davidwendt
Align integral types in ORC to specs (#15008) @vuule
Clean up detail sequence header inclusion (#15007) @PointKernel
Add groupby.apply(include_groups=) to match pandas 2.2 deprecation (#15006) @mroeschke
Use offsetalator in cudf::interleave_columns() (#15004) @davidwendt
Use offsetalator in cudf::rowbitcount() (#15003) @davidwendt
Use offsetalator in cudf::strings::wrap() (#15002) @davidwendt
Use offsetalator in cudf::strings::reverse (#15001) @davidwendt
Deprecate groupby fillna (#15000) @mroeschke
Ensure to_* IO methods respect pandas 2.2 keyword only deprecation (#14999) @mroeschke
Remove unneeded calls to createcharschild_column utility (#14997) @davidwendt
Add environment-agnostic scripts for running ctests and pytests (#14992) @trxcllnt
Filter all DeprecationWarning's by ArrowTable.to_pandas() (#14989) @galipremsagar
Deprecate replace with categorical columns (#14988) @mroeschke
Deprecate delimwhitespace in readcsv for pandas 2.2 (#14986) @mroeschke
Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
Ensure that ctest is called with --no-tests=error. (#14983) @bdice
Deprecate non-integer periods in date_range and interval_range (#14976) @galipremsagar
Update ops-bot.yaml (#14974) @AyodeAwe
Use page statistics in Parquet reader (#14973) @etseidl
Use fused types for overloaded function signatures (#14969) @vyasr
Deprecate certain frequency strings (#14967) @galipremsagar
Update copyrights for 24.04. (#14964) @bdice
Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
Introduce GetJsonObjectOptions in getJSONObject Java API (#14956) @SurajAralihalli
JNI JSON read with DataSource and infered schema, along with basic java nested Schema JSON reads (#14954) @revans2
Make codecov only informational (always pass). (#14952) @bdice
Replace legacy cudf and dask_cudf imports as (d)gd (#14944) @mroeschke
Replace isdatetime64tz/interval_dtype with isinstance (#14943) @mroeschke
Update tests for pandas 2. (#14941) @bdice
Use more public pandas APIs (#14929) @mroeschke
Replace local copyright check with pre-commit-hooks verify-copyright (#14917) @KyleFromNVIDIA
Add pandas-2.x support in cudf (#14916) @galipremsagar
Use offsetalator in nvtext::bytepairencoding (#14888) @davidwendt
De-DOS line-endings (#14880) @wence-
Add detail cuco_allocator (#14877) @PointKernel
Move all core types to using enum class in Cython (#14876) @vyasr
Read cudf.__version__ in Sphinx build (#14872) @KyleFromNVIDIA
Use int64 offset types for accessing code-points in nvtext::normalize (#14868) @davidwendt
Read version from VERSION file in CMake (#14867) @KyleFromNVIDIA
Update conda-cpp-post-build-checks to branch-24.04. (#14854) @bdice
Update cudf for compatibility with the latest cuco (#14849) @PointKernel
Remove deprecated strings functions (#14848) @davidwendt
Fix CI workflows for pandas-tests and add test summary. (#14847) @bdice
Use offsetalator in cudf::strings::copy_slice (#14844) @davidwendt
Fix V2 Parquet page alignment for use with zStandard compression (#14841) @etseidl
Fix calls to deprecated strings factory API in examples. (#14838) @bdice
Update pre-commit hooks (#14837) @bdice
Use rapids_cuda_set_runtime to determine cuda runtime usage by target (#14833) @vyasr
Remove getmeminfo functions from custom memory resources (#14832) @harrism
Fix debug build by splitting rowoperatortests_utilities.cu (#14826) @davidwendt
Remove -DNVBenchENABLECUPTI=OFF. (#14820) @bdice
Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
Branch 24.04 merge branch 24.02 (#14809) @vyasr
Branch 24.04 merge branch 24.02 (#14806) @vyasr
Introduce basic "cudf" backend for Dask Expressions (#14805) @rjzamora
Remove build_struct|list_column (#14786) @mroeschke
Use offsetalator in nvtext tokenize functions (#14783) @davidwendt
Reduce execution time of Python ORC tests (#14776) @vuule
Use offsetalator in cudf::strings::split functions (#14757) @davidwendt
Use offsetalator in cudf::strings::findall (#14745) @davidwendt
Use offsetalator in cudf::strings::url_decode (#14744) @davidwendt
Use getoffsetvalue utility in strings shift function (#14743) @davidwendt
Use as_column instead of full (#14698) @mroeschke
List all notable breaking changes (#13535) @galipremsagar

- C++
Published by raydouglass about 2 years ago

https://github.com/rapidsai/cudf - v24.04.00

🚨 Breaking Changes

Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
Change exceptions thrown by copying APIs (#15319) @vyasr
Change stringscolumnview::char_size to return int64 (#15197) @davidwendt
Upgrade to arrow-14.0.2 (#15108) @galipremsagar
Add support for pandas-2.2 in cudf (#15100) @galipremsagar
Deprecate cudf::hashing::sparkmurmurhash3x86_32 (#15074) @davidwendt
Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
Raise an error on import for unsupported GPUs. (#15053) @bdice
Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
Add future_stack to DataFrame.stack (#15015) @galipremsagar
Deprecate groupby fillna (#15000) @mroeschke
Deprecate replace with categorical columns (#14988) @mroeschke
Deprecate delimwhitespace in readcsv for pandas 2.2 (#14986) @mroeschke
Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
Add pandas-2.x support in cudf (#14916) @galipremsagar
Use cuco::static_set in the hash-based groupby (#14813) @PointKernel

🐛 Bug Fixes

Fix an issue with creating a series from scalar when dtype='category' (#15476) @galipremsagar
Update pre-commit-hooks to v0.0.3 (#15355) @KyleFromNVIDIA
[BUG][JNI] Trigger MemoryBuffer.onClosed after memory is freed (#15351) @abellina
Fix an issue with multiple short list rowgroups using the Parquet chunked reader. (#15342) @nvdbaranec
Avoid importing dask-expr if "query-planning" config is False (#15340) @rjzamora
Fix gtests/ERROR_TEST errors when run in Debug (#15317) @davidwendt
Fix OOB read in inflate_kernel (#15309) @vuule
Work around a cuFile error when running CSV tests with memcheck (#15293) @vuule
Fix Doxygen upload directory (#15291) @KyleFromNVIDIA
Fix Doxygen check (#15289) @KyleFromNVIDIA
Reintroduce PANDASGE220 import (#15287) @wence-
Fix mean computation for the geometric distribution in the data generator (#15282) @vuule
Fix Parquet decimal64 stats (#15281) @etseidl
Make linking of nvtx3-cpp BUILDLOCALINTERFACE (#15271) @KyleFromNVIDIA
Workaround compute-sanitizer memcheck bug (#15259) @davidwendt
Cleanup hostdevice_vector and add more APIs (#15252) @ttnghia
Fix number of rows in randomly generated lists columns (#15248) @vuule
Fix wrong output for collect_list/collect_set of lists column (#15243) @ttnghia
Fix testchunkedPackTwoPasses to copy from the bounce buffer (#15220) @abellina
Fix accessing .columns by an external API (#15212) @galipremsagar
[JNI] Disable testChunkedPackTwoPasses for now (#15210) @abellina
Update labeler and codeowner configs for CMake files (#15208) @PointKernel
Avoid dict normalization in __dask_tokenize__ (#15187) @rjzamora
Fix memcheck error in distinct inner join (#15164) @PointKernel
Remove unneeded script parameters in testcppmemcheck.sh (#15158) @davidwendt
Fix ListColumn.to_pandas() to retain list type (#15155) @galipremsagar
Avoid factorization in MultiIndex.to_pandas (#15150) @mroeschke
Fix GroupBy.get_group and GroupBy.indices (#15143) @wence-
Remove const from range_window_bounds::_extent. (#15138) @mythrocks
DataFrame.columns = ... retains RangeIndex & set dtype (#15129) @mroeschke
Correctly handle output for GroupBy.apply when chunk results are reindexed series (#15109) @brandon-b-miller
Fix Series.groupby.shift with a MultiIndex (#15098) @mroeschke
Fix reductions when DataFrame has MulitIndex columns (#15097) @mroeschke
Fix deprecation warnings for deprecated hash() calls (#15095) @davidwendt
Add support for arrow large_string in cudf (#15093) @galipremsagar
Fix sort_values pytest failure with pandas-2.x regression (#15092) @galipremsagar
Resolve path parsing issues in get_json_object (#15082) @SurajAralihalli
Fix bugs in handling of delta encodings (#15075) @etseidl
Fix is_device_write_preferred in void_sink and user_sink_wrapper (#15064) @vuule
Eliminate duplicate allocation of nested string columns (#15061) @vuule
Raise an error on import for unsupported GPUs. (#15053) @bdice
Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
Fix Index.difference to handle duplicate values when one of the inputs is empty (#15016) @galipremsagar
Add future_stack to DataFrame.stack (#15015) @galipremsagar
Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
Fix DataFrame.sort_index to respect ignore_index on all axis (#14995) @galipremsagar
Raise for pyarrow array that is tz-aware (#14980) @mroeschke
Direct SeriesGroupBy.aggregate to SeriesGroupBy.agg (#14971) @rjzamora
Respect IntervalDtype and CategoricalDtype objects passed by users (#14961) @mroeschke
unset CUDF_SPILL after a pytest (#14958) @galipremsagar
Fix Null literals to be not parsed as string when mixed types as string is enabled in JSON reader (#14939) @karthikeyann
Fix chunked reads of Parquet delta encoded pages (#14921) @etseidl
Fix reading offset for data stream in ORC reader (#14911) @ttnghia
Enable sanitizer check for a test case testORCReadAndWriteForDecimal128 (#14897) @res-life
Fix dask token normalization (#14829) @rjzamora
Fix 24.04 versions (#14825) @raydouglass
Ensure slow private attrs are maybe proxies (#14380) @mroeschke

📖 Documentation

Ignore DLManagedTensor in the docs build (#15392) @davidwendt
Revert "Temporarily disable docs errors. (#15265)" (#15269) @bdice
Temporarily disable docs errors. (#15265) @bdice
Update developer_guide.md with new guidance on quoted internal includes (#15238) @harrism
Fix broken link for developer guide (#15025) @sanjana098
[DOC] Update typo in docs example of structscolumnwrapper (#14949) @karthikeyann
Update cudf.pandas FAQ. (#14940) @bdice
Optimize doc builds (#14856) @vyasr
Add developer guideline to use east const. (#14836) @bdice
Document how cuDF is pronounced (#14753) @pentschev
Notes convert to Pandas-compat (#12641) @Touutae-lab

🚀 New Features

Address inconsistency in single quote normalization in JSON reader (#15324) @shrshi
Use JNI pinned pool resource with cuIO (#15255) @abellina
Add DELTABYTEARRAY encoder for Parquet (#15239) @etseidl
Migrate filling operations to pylibcudf (#15225) @brandon-b-miller
[JNI] rmm based pinned pool (#15219) @abellina
Implement zero-copy host buffer source instead of using an arrow implementation (#15189) @vuule
Enable creation of columns from scalar (#15181) @vyasr
Use NVTX from GitHub. (#15178) @bdice
Implement segmented_row_bit_count for computing row sizes by segments of rows (#15169) @ttnghia
Implement search using pylibcudf (#15166) @vyasr
Add distinct left join (#15149) @PointKernel
Add cardinality control for groupby benchs with flat types (#15134) @PointKernel
Add ability to request Parquet encodings on a per-column basis (#15081) @etseidl
Automate include grouping order in .clang-format (#15063) @harrism
Requesting a clean build directory also clears Jitify cache (#15052) @robertmaynard
API for JSON unquoted whitespace normalization (#15033) @shrshi
Implement concatenate, lists.explode, merge, sorting, and stream compaction in pylibcudf (#15011) @vyasr
Implement replace in pylibcudf (#15005) @vyasr
Add distinct key inner join (#14990) @PointKernel
Implement rolling in pylibcudf (#14982) @vyasr
Implement joins in pylibcudf (#14972) @vyasr
Implement scans and reductions in pylibcudf (#14970) @vyasr
Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
Implement groupby in pylibcudf (#14945) @vyasr
Support casting of Map type to string in JSON reader (#14936) @karthikeyann
POC for whitespace removal in input JSON data using FST (#14931) @shrshi
Support for LZ4 compression in ORC and Parquet (#14906) @vuule
Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
Migrate unary operations to pylibcudf (#14850) @vyasr
Migrate binary operations to pylibcudf (#14821) @vyasr
Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
Support CUDA 12.2 (#14712) @jameslamb

🛠️ Improvements

Use conda env create --yes instead of --force (#15403) @bdice
Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
Change exceptions thrown by copying APIs (#15319) @vyasr
Enable branch testing for cudf.pandas (#15316) @galipremsagar
Replace black with ruff-format (#15312) @mroeschke
This fixes an NPE when trying to read empty JSON data by adding a new API for missing information (#15307) @revans2
Address poor performance of Parquet string decoding (#15304) @etseidl
Update script input name (#15301) @AyodeAwe
Make testreadparquetpartitionedfiltered data deterministic (#15296) @mroeschke
Add timeout for cudf.pandas pandas tests (#15284) @galipremsagar
Add upper bound to prevent usage of NumPy 2 (#15283) @bdice
Fix cudf::test::tohost return of hostvector (#15263) @davidwendt
Implement grouped product scan (#15254) @wence-
Add CUDA 12.4 to supported PTX versions (#15247) @brandon-b-miller
Implement DataFrame|Series.squeeze (#15244) @mroeschke
Roll back ipow changes due to register pressure. (#15242) @pmattione-nvidia
Remove createcharschild_column utility (#15241) @davidwendt
Update dlpack to version 0.8 (#15237) @dantegd
Improve performance in JSON reader when mixed_types_as_string option is enabled (#15236) @shrshi
Remove row conversion code from libcudf (#15234) @ttnghia
Use variable substitution for RAPIDS version in Doxyfile (#15231) @KyleFromNVIDIA
Add ListColumns.topandas(arrowtype=) (#15228) @mroeschke
Treat dask-cudf CI artifacts as pure wheels (#15223) @bdice
Clean up usage of CUDA_ARCH and other macros. (#15218) @bdice
DOC: use constants in performance-comparisons.ipynb (#15215) @raybellwaves
Rewrite conversion in terms of column (#15213) @vyasr
Switch pytest-xdist algo to worksteal (#15207) @galipremsagar
Deprecate stringscolumnview::offsets_begin() (#15205) @davidwendt
Add get_upstream_resource method to stream_checking_resource_adaptor (#15203) @miscco
Tune up row size estimation in the data generator (#15202) @vuule
Fix offset value for generating test data in parquet_chunked_reader_test.cu (#15200) @ttnghia
Change stringscolumnview::char_size to return int64 (#15197) @davidwendt
Fix includes for row_operators.cuh (#15194) @davidwendt
Generalize GHA selectors for pure Python testing (#15191) @bdice
Improvements for __cuda_array_interface__ tests (#15188) @bdice
Allow to_pandas to return pandas.ArrowDtype (#15182) @mroeschke
Ignore byte_range in read_json when the size is not smaller than the input data (#15180) @vuule
Expose new stablesort and finish streamcompaction in pylibcudf (#15175) @wence-
[ci] update matrix filters for dask-cudf builds (#15174) @jameslamb
Change makestringschildren to return uvector (#15171) @davidwendt
Don't override to_pandas for Datelike columns (#15167) @mroeschke
Drop python-snappy from dependencies. (#15161) @bdice
Add microkernels for fixed-width and fixed-width dictionary in Parquet decode (#15159) @abellina
Make HostColumnVector.DataType accessor methods public (#15157) @jbrennan333
Java bindings for left outer distinct join (#15154) @jlowe
Forward-merge branch-24.02 to branch-24.04 (#15153) @bdice
Enable pandas pytests for cudf.pandas (#15147) @galipremsagar
Add java option to keep quotes for JSON reads (#15146) @revans2
Change cross-pandas-version testing in cudf (#15145) @galipremsagar
Use hostdevice_vector in kernel_error to avoid the pageable copy (#15140) @vuule
Clean up Columns.astype & cudf.dtype (#15125) @mroeschke
Simplify some to_pandas implementations (#15123) @mroeschke
Java: Add leak tracking for Scalar instances (#15121) @jlowe
Remove calls to stringscolumnview::offsets_begin() (#15112) @davidwendt
Add support for Python 3.11, require NumPy 1.23+ (#15111) @jameslamb
Compile-time ipow computation with array lookup (#15110) @pmattione-nvidia
Upgrade to arrow-14.0.2 (#15108) @galipremsagar
Dynamically set version in RAPIDS doc builds (#15101) @jakirkham
Add support for pandas-2.2 in cudf (#15100) @galipremsagar
Update devcontainers to CUDA Toolkit 12.2 (#15099) @trxcllnt
Fix datetime binop pytest failures in pandas-2.2 (#15090) @galipremsagar
Validate types in pylibcudf Column/Table constructors (#15088) @wence-
xfail testjoinorderingpandascompat for pandas 2.2 (#15080) @mroeschke
Add general purpose host memory allocator reference to cuIO with a demo of pooled-pinned allocation. (#15079) @nvdbaranec
Adjust test_binops for pandas 2.2 (#15078) @mroeschke
Remove offsetsbegin() call from nvtext::generatengrams (#15077) @davidwendt
Use offsetalator in cudf::detail::hasnonemptynull_rows (#15076) @davidwendt
Deprecate cudf::hashing::sparkmurmurhash3x86_32 (#15074) @davidwendt
Fix cudf::test::to_host to handle both offset types for strings columns (#15073) @davidwendt
Add condition for testgroupbynulls_basic in pandas 2.2 (#15072) @mroeschke
xfail tests in testudfmasked_ops due to pandas 2.2 bug (#15071) @mroeschke
target branch-24.04 for GitHub Actions workflows (#15069) @jameslamb
Implement stable version of cudf::sort (#15066) @wence-
Fix ORC and JSON tests failures for pandas 2.2 (#15062) @mroeschke
Adjust test_joining for pandas 2.2 (#15060) @mroeschke
Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
Fix test_resample index dtype checking for pandas 2.2 (#15058) @mroeschke
Split out strings/replace.cu and rework its gtests (#15054) @davidwendt
Avoid incompatible value type setting in test_rolling for pandas 2.2 (#15050) @mroeschke
Change chained replace inplace test to COW test for pandas 2.2 (#15049) @mroeschke
Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
Avoid chained indexing in test_indexing for pandas 2.2 (#15045) @mroeschke
Avoid pandas 2.2 DeprecationWarning in test_hdf (#15044) @mroeschke
Use appropriate makeoffsetschild_column for building lists columns (#15043) @davidwendt
Factor out position-offsets logic from strings split_helper utility (#15040) @davidwendt
Forward-merge branch-24.02 to branch-24.04 (#15039) @bdice
Clean up nvtx macros (#15038) @PointKernel
Add xfailures for test_applymap for pandas 2.2 (#15034) @mroeschke
Expose libcudf filter expression in read_parquet (#15028) @wence-
Adjust tests in test_dataframe.py for pandas 2.2 (#15023) @mroeschke
Adjust testdatetimeinfer_format for pandas 2.2 (#15021) @mroeschke
Performance optimizations for parquet sub-rowgroup reader. (#15020) @nvdbaranec
JNI bindings for distincthashjoin (#15019) @jlowe
Change copyifsafe to call thrust instead of the overload function (#15018) @davidwendt
Improve performance of copyifelse for long strings (#15017) @davidwendt
Fix isstringdtype test for pandas 2.2 (#15012) @mroeschke
Rework cudf::strings::detail::copy_range for offsetalator (#15010) @davidwendt
Use offsetalator in cudf::getjsonobject() (#15009) @davidwendt
Align integral types in ORC to specs (#15008) @vuule
Clean up detail sequence header inclusion (#15007) @PointKernel
Add groupby.apply(include_groups=) to match pandas 2.2 deprecation (#15006) @mroeschke
Use offsetalator in cudf::interleave_columns() (#15004) @davidwendt
Use offsetalator in cudf::rowbitcount() (#15003) @davidwendt
Use offsetalator in cudf::strings::wrap() (#15002) @davidwendt
Use offsetalator in cudf::strings::reverse (#15001) @davidwendt
Deprecate groupby fillna (#15000) @mroeschke
Ensure to_* IO methods respect pandas 2.2 keyword only deprecation (#14999) @mroeschke
Remove unneeded calls to createcharschild_column utility (#14997) @davidwendt
Add environment-agnostic scripts for running ctests and pytests (#14992) @trxcllnt
Filter all DeprecationWarning's by ArrowTable.to_pandas() (#14989) @galipremsagar
Deprecate replace with categorical columns (#14988) @mroeschke
Deprecate delimwhitespace in readcsv for pandas 2.2 (#14986) @mroeschke
Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
Ensure that ctest is called with --no-tests=error. (#14983) @bdice
Deprecate non-integer periods in date_range and interval_range (#14976) @galipremsagar
Update ops-bot.yaml (#14974) @AyodeAwe
Use page statistics in Parquet reader (#14973) @etseidl
Use fused types for overloaded function signatures (#14969) @vyasr
Deprecate certain frequency strings (#14967) @galipremsagar
Update copyrights for 24.04. (#14964) @bdice
Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
Introduce GetJsonObjectOptions in getJSONObject Java API (#14956) @SurajAralihalli
JNI JSON read with DataSource and infered schema, along with basic java nested Schema JSON reads (#14954) @revans2
Make codecov only informational (always pass). (#14952) @bdice
Replace legacy cudf and dask_cudf imports as (d)gd (#14944) @mroeschke
Replace isdatetime64tz/interval_dtype with isinstance (#14943) @mroeschke
Update tests for pandas 2. (#14941) @bdice
Use more public pandas APIs (#14929) @mroeschke
Replace local copyright check with pre-commit-hooks verify-copyright (#14917) @KyleFromNVIDIA
Add pandas-2.x support in cudf (#14916) @galipremsagar
Use offsetalator in nvtext::bytepairencoding (#14888) @davidwendt
De-DOS line-endings (#14880) @wence-
Add detail cuco_allocator (#14877) @PointKernel
Move all core types to using enum class in Cython (#14876) @vyasr
Read cudf.__version__ in Sphinx build (#14872) @KyleFromNVIDIA
Use int64 offset types for accessing code-points in nvtext::normalize (#14868) @davidwendt
Read version from VERSION file in CMake (#14867) @KyleFromNVIDIA
Update conda-cpp-post-build-checks to branch-24.04. (#14854) @bdice
Update cudf for compatibility with the latest cuco (#14849) @PointKernel
Remove deprecated strings functions (#14848) @davidwendt
Fix CI workflows for pandas-tests and add test summary. (#14847) @bdice
Use offsetalator in cudf::strings::copy_slice (#14844) @davidwendt
Fix V2 Parquet page alignment for use with zStandard compression (#14841) @etseidl
Fix calls to deprecated strings factory API in examples. (#14838) @bdice
Update pre-commit hooks (#14837) @bdice
Use rapids_cuda_set_runtime to determine cuda runtime usage by target (#14833) @vyasr
Remove getmeminfo functions from custom memory resources (#14832) @harrism
Fix debug build by splitting rowoperatortests_utilities.cu (#14826) @davidwendt
Remove -DNVBenchENABLECUPTI=OFF. (#14820) @bdice
Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
Branch 24.04 merge branch 24.02 (#14809) @vyasr
Branch 24.04 merge branch 24.02 (#14806) @vyasr
Introduce basic "cudf" backend for Dask Expressions (#14805) @rjzamora
Remove build_struct|list_column (#14786) @mroeschke
Use offsetalator in nvtext tokenize functions (#14783) @davidwendt
Reduce execution time of Python ORC tests (#14776) @vuule
Use offsetalator in cudf::strings::split functions (#14757) @davidwendt
Use offsetalator in cudf::strings::findall (#14745) @davidwendt
Use offsetalator in cudf::strings::url_decode (#14744) @davidwendt
Use getoffsetvalue utility in strings shift function (#14743) @davidwendt
Use as_column instead of full (#14698) @mroeschke
List all notable breaking changes (#13535) @galipremsagar

- C++
Published by raydouglass about 2 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v24.06.00

🔗 Links

🚨 Breaking Changes

Remove deprecated strings offsets_begin (#15454) @davidwendt
Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
Remove deprecated hash() and sparkmurmurhash3x86_32() (#15375) @davidwendt
Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
[FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
Align date_range defaults with pandas, support tz (#15139) @mroeschke

🐛 Bug Fixes

nanoarrow uses package override for proper pinned versions generation (#15515) @robertmaynard
Make improvements in pandas-test reporting (#15485) @galipremsagar
Fixed page data truncation in parquet writer under certain conditions. (#15474) @nvdbaranec
Only use data_type constructor with scale for decimal types (#15472) @wence-
Avoid "p2p" shuffle as a default when dask_cudf is imported (#15469) @rjzamora
Fix debug build errors from toarrowdevice_test.cpp (#15463) @davidwendt
Fix basenormalator::integersizeof_fn integer dispatch (#15457) @davidwendt
Allow consumers of static builds to find nanoarrow (#15456) @robertmaynard
Allow jit compilation when using a splayed CUDA toolkit (#15451) @robertmaynard
Test static builds in CI and fix nanoarrow configure (#15437) @vyasr
Fixes potential race in JSON parser when parsing JSON lines format and when recovering from invalid lines (#15419) @elstehle
Fix errors in chunked ORC writer when no tables were (successfully) written (#15393) @vuule
Support implicit array conversion with query-planning enabled (#15378) @rjzamora
Fix arrow-based round trip of empty dataframes (#15373) @wence-
Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
Remove boundscheck=False setting in cython files (#15362) @wence-
Patch dask-expr var logic in dask-cudf (#15347) @rjzamora
Fix for logical and syntactical errors in libcudf c++ examples (#15346) @mhaseeb123
Disable dask-expr in docs builds. (#15343) @bdice
Apply the cuFile error work around to data_sink as well (#15335) @vuule

📖 Documentation

Add debug tips section to libcudf developer guide (#15329) @davidwendt

🚀 New Features

Introduce benchmark suite for JSON reader options (#15124) @shrshi
Add to_arrow_device function to cudf interop using nanoarrow (#15047) @zeroshade

🛠️ Improvements

Enable tests/scalar and test/series in cudf.pandas tests (#15486) @mroeschke
Avoid .ordered and .categories from being settable in CategoricalColumn and CategoricalDtype (#15475) @mroeschke
Ignore pandas tests for cudf.pandas that need motoserver (#15468) @mroeschke
Use cachedproperty for NumericColumn.nancount instead of .nancount variable (#15466) @mroeschke
Add custom status check workflow (#15464) @galipremsagar
Enable tests/strings/test_api.py and tests/io/pytables in cudf.pandas tests (#15461) @mroeschke
Remove deprecated strings offsets_begin (#15454) @davidwendt
Enable tests/windows/ in cudf.pandas tests (#15444) @mroeschke
Enable tests/interchange/test_impl.py in cudf.pandas tests (#15443) @mroeschke
Enable tests/io/testuseragent.py in cudf pandas tests (#15442) @mroeschke
Remove prior test skipping in run-pandas-tests with testing 2.2.1 (#15440) @mroeschke
Support orc and text IO with dask-expr using legacy conversion (#15439) @rjzamora
Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
Enable dask_cudf json and s3 tests with query-planning on (#15408) @rjzamora
Bump ruff and codespell pre-commit checks (#15407) @mroeschke
Enable all tests for arm arch (#15402) @galipremsagar
Remove deprecated hash() and sparkmurmurhash3x86_32() (#15375) @davidwendt
Enable test-reporting for pandas pytests in CI (#15369) @galipremsagar
Use logical types in Parquet reader (#15365) @etseidl
Forward-merge branch-24.04 to branch-24.06 (#15349) @bdice
Use ruff pydocstyle over pydocstyle pre-commit hook (#15345) @mroeschke
Refactor stream mode setup for gtests (#15337) @davidwendt
Avoid duplicate dask-cudf testing (#15333) @rjzamora
Update udfcpp to use rapidscpm_cccl. (#15331) @bdice
Forward-merge branch-24.04 into branch-24.06 skip ci @rapids-bot[bot]
Allow numeric_only=True for simple groupby reductions (#15326) @rjzamora
Drop CentOS 7 support. (#15323) @bdice
Rework cudf::findandreplaceall to use gather-based makestrings_column (#15305) @davidwendt
First pass at adding testing for pylibcudf (#15300) @vyasr
[FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
Rework cudf::replacenulls to use strings::detail::copyif_else (#15286) @davidwendt
Large strings support in cudf::concatenate (#15195) @davidwendt
Use less iscategorical_dtype (#15148) @mroeschke
Align date_range defaults with pandas, support tz (#15139) @mroeschke
ModuleAccelerator performance: cache the result of checking if a caller is in the denylist (#15056) @shwina
Use offsetalator in cudf::strings::replace functions (#14824) @davidwendt
Cleanup some timedelta/datetime column logic (#14715) @mroeschke
Refactor numpy array input in as_column (#14651) @mroeschke

- C++
Published by rapids-bot[bot] about 2 years ago

https://github.com/rapidsai/cudf - v24.02.02

🚨 Breaking Changes

Remove **kwargs from astype (#14765) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Drop Pascal GPU support. (#14630) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Deprecate cudf::makestringscolumn accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
Include writer code and writerVersion in ORC files (#14458) @vuule
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
Switch to scikit-build-core (#13531) @vyasr

🐛 Bug Fixes

Bump to nvcomp 3.0.6. (#15128) @bdice
[HOTFIX] Unpin numba<0.58 (#15031) @raydouglass
Exclude tests from builds (#14981) @vyasr
Fix the bounce buffer size in ORC writer (#14947) @vuule
Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
Fix totalbytesize in Parquet row group metadata (#14802) @etseidl
Fix index difference to follow the pandas format (#14789) @amiralimi
Fix shared-workflows repo name (#14784) @raydouglass
Remove unparseable attributes from all nodes (#14780) @vyasr
Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
Fix calls to deprecated strings factory API (#14771) @davidwendt
Fix ptx file discovery in editable installs (#14767) @vyasr
Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
Enable intermediate proxies to be picklable (#14752) @shwina
Add CUDFTESTPROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
Fix CMake args (#14746) @vyasr
Fix logic bug introduced in #14730 (#14742) @wence-
[Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
Fix Groupby.get_group (#14728) @rjzamora
Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
Split cuda versions for notebook testing (#14722) @raydouglass
Fix to_numeric not preserving Series index and name (#14718) @mroeschke
Update dask-cudf wheel name (#14713) @raydouglass
Fix strings::contains matching end of string target (#14711) @davidwendt
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
Potential fix for peformance regression in #14415 (#14706) @etseidl
Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
Skip numba test that fails on ARM (#14702) @brandon-b-miller
Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
Fix nanasnull not being respected when passing arrow object (#14688) @mroeschke
Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
Add row conversion code from spark-rapids-jni (#14664) @ttnghia
Unconditionally export the CCCL path (#14656) @vyasr
Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
Fix invalid memory access in Parquet reader (#14637) @etseidl
Use columnempty over ascolumn([]) (#14632) @mroeschke
Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
Remove non-empty nulls in cudf::getjsonobject (#14609) @davidwendt
Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
Address potential race conditions in Parquet reader (#14602) @etseidl
Fix DataFrame.reindex removing column name (#14601) @mroeschke
Remove unsanitized input test data from copy gtests (#14600) @davidwendt
Fix race detected in Parquet writer (#14598) @etseidl
Correct invalid or missing return types (#14587) @robertmaynard
Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
Fixes a symbol group lookup table issue (#14561) @elstehle
Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
Improve memory footprint of isin by using contains (#14478) @wence-
Move creation of env.yaml outside the current directory (#14476) @davidwendt
Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
Correct dtype of count aggregations on empty dataframes (#14473) @wence-
Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
Fix default stream use in the CSV reader (#14443) @vuule
Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
Defer PTX file load to runtime (#13690) @brandon-b-miller

📖 Documentation

Disable parallel build (#14796) @vyasr
Add pylibcudf to the docs (#14791) @vyasr
Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
More doxygen fixes (#14639) @vyasr
Enable doxygen XML generation and fix issues (#14477) @vyasr
Some doxygen improvements (#14469) @vyasr
Remove warning in dask-cudf docs (#14454) @wence-
Update README links with redirects. (#14378) @bdice
Add pip install instructions to README (#13677) @shwina

🚀 New Features

Add ci check for external kernels (#14768) @robertmaynard
JSON single quote normalization API (#14729) @shrshi
Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
Don't constrain numba<0.58 (#14616) @brandon-b-miller
Add DELTALENGTHBYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
JSON quote normalization (#14545) @shrshi
Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
Implement more copying APIs in pylibcudf (#14508) @vyasr
Include writer code and writerVersion in ORC files (#14458) @vuule
Parquet sub-rowgroup reading. (#14360) @nvdbaranec
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
PARQUET-2261 Size Statistics (#14000) @etseidl
Improve GroupBy JIT error handling (#13854) @brandon-b-miller
Generate unified Python/C++ docs (#13846) @vyasr
Expand JIT groupby test suite (#13813) @brandon-b-miller

🛠️ Improvements

Pin pytest<8 (#14920) @galipremsagar
Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
Remove **kwargs from astype (#14765) @mroeschke
fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
Add pynvjitlink as a dependency (#14763) @brandon-b-miller
Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
Pin pytest-cases<3.8.2 (#14756) @mroeschke
Use fromdata instead of fromcolumns for initialzing Frame (#14755) @mroeschke
Consolidate cudf object handling in as_column (#14754) @mroeschke
Reduce execution time of Parquet C++ tests (#14750) @vuule
Implement to_datetime(..., utc=True) (#14749) @mroeschke
Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
Remove unused/single use methods (#14739) @mroeschke
refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
Remove unneeded methods in Column (#14730) @mroeschke
Clean up base column methods (#14725) @mroeschke
Ensure column.fillna signatures are consistent (#14724) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Replace asnumerical with asnumerical_column/codes (#14719) @mroeschke
Use offsetalator in gather_chars (#14700) @davidwendt
Use makestringschildren for fill() specialization logic (#14697) @davidwendt
Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
Fix call to deprecated factory function (#14695) @davidwendt
Use as_column instead of arange for range like inputs (#14689) @mroeschke
Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
Split parquet test into multiple files (#14663) @etseidl
Custom error messages for IO with nonexistent files (#14662) @vuule
Explicitly pass .dtype into isfoodtype functions (#14657) @mroeschke
Basic validation in reader benchmarks (#14647) @vuule
Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
Consolidate memoryview handling in as_column (#14643) @mroeschke
Convert FieldType to scoped enum (#14642) @vuule
Use instance over isfoodtype (#14641) @mroeschke
Use isinstance over isfoodtype internally (#14638) @mroeschke
Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
Drop nvbench patch for nvml. (#14631) @bdice
Drop Pascal GPU support. (#14630) @bdice
Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
Create strings-specific makeoffsetschild_column for multiple offset types (#14612) @davidwendt
Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
Support freq in DatetimeIndex (#14593) @shwina
Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
Remove WORKSPACE env var from cudftest tempdirectory class (#14588) @davidwendt
Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
Use cuda::proclaimreturntype on device lambdas. (#14577) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Update dependencies.yaml to new pip index (#14575) @vyasr
Simplify Python CMake (#14565) @vyasr
Java expose parquet passreadlimit (#14564) @revans2
Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
Use cudftest tempdirectory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
Fix return type of prefix increment overloads (#14544) @vuule
Make bpemergepairs_impl member private (#14543) @davidwendt
Small clean up in io::statistics (#14542) @vuule
Change json gtest environment variable to compile-time definition (#14541) @davidwendt
Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
Move non-templated inline function definitions from tableview.hpp to tableview.cpp (#14535) @davidwendt
Add JNI for strings::code_points (#14533) @thirtiseven
Add a test for issue 12773 (#14529) @vyasr
Split libarrow build dependencies. (#14506) @bdice
Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
Refactor Parquet kernel_error (#14464) @etseidl
Deprecate cudf::makestringscolumn accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
Expose stream parameter in public nvtext APIs (#14456) @davidwendt
Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
Refactor cudf.Series.init (#14450) @mroeschke
Remove the use of volatile in Parquet (#14448) @vuule
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Testing stream pool implementation (#14437) @shrshi
Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
Use isinstance(..., cudf.IntervalDtype) instead of isintervaldtype (#14424) @mroeschke
Use isinstance(..., cudf.CategoricalDtype) instead of iscategoricaldtype (#14423) @mroeschke
Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
REF: Remove instances of pd.core (#14421) @mroeschke
Expose streams in public filling APIs for label_bins (#14401) @ZelboK
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Limit DELTABINARYPACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
Expose streams in Parquet reader and writer APIs (#14359) @shrshi
Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
Expose streams in ORC reader and writer APIs (#14350) @shrshi
Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
Add cuDF devcontainers (#14015) @trxcllnt
Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
Switch to scikit-build-core (#13531) @vyasr
Simplify null count checking in column equality comparator (#13312) @vyasr

- C++
Published by raydouglass over 2 years ago

https://github.com/rapidsai/cudf - v24.02.01

🚨 Breaking Changes

Remove **kwargs from astype (#14765) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Drop Pascal GPU support. (#14630) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Deprecate cudf::makestringscolumn accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
Include writer code and writerVersion in ORC files (#14458) @vuule
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
Switch to scikit-build-core (#13531) @vyasr

🐛 Bug Fixes

[HOTFIX] Unpin numba<0.58 (#15031) @raydouglass
Exclude tests from builds (#14981) @vyasr
Fix the bounce buffer size in ORC writer (#14947) @vuule
Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
Fix totalbytesize in Parquet row group metadata (#14802) @etseidl
Fix index difference to follow the pandas format (#14789) @amiralimi
Fix shared-workflows repo name (#14784) @raydouglass
Remove unparseable attributes from all nodes (#14780) @vyasr
Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
Fix calls to deprecated strings factory API (#14771) @davidwendt
Fix ptx file discovery in editable installs (#14767) @vyasr
Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
Enable intermediate proxies to be picklable (#14752) @shwina
Add CUDFTESTPROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
Fix CMake args (#14746) @vyasr
Fix logic bug introduced in #14730 (#14742) @wence-
[Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
Fix Groupby.get_group (#14728) @rjzamora
Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
Split cuda versions for notebook testing (#14722) @raydouglass
Fix to_numeric not preserving Series index and name (#14718) @mroeschke
Update dask-cudf wheel name (#14713) @raydouglass
Fix strings::contains matching end of string target (#14711) @davidwendt
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
Potential fix for peformance regression in #14415 (#14706) @etseidl
Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
Skip numba test that fails on ARM (#14702) @brandon-b-miller
Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
Fix nanasnull not being respected when passing arrow object (#14688) @mroeschke
Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
Add row conversion code from spark-rapids-jni (#14664) @ttnghia
Unconditionally export the CCCL path (#14656) @vyasr
Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
Fix invalid memory access in Parquet reader (#14637) @etseidl
Use columnempty over ascolumn([]) (#14632) @mroeschke
Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
Remove non-empty nulls in cudf::getjsonobject (#14609) @davidwendt
Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
Address potential race conditions in Parquet reader (#14602) @etseidl
Fix DataFrame.reindex removing column name (#14601) @mroeschke
Remove unsanitized input test data from copy gtests (#14600) @davidwendt
Fix race detected in Parquet writer (#14598) @etseidl
Correct invalid or missing return types (#14587) @robertmaynard
Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
Fixes a symbol group lookup table issue (#14561) @elstehle
Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
Improve memory footprint of isin by using contains (#14478) @wence-
Move creation of env.yaml outside the current directory (#14476) @davidwendt
Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
Correct dtype of count aggregations on empty dataframes (#14473) @wence-
Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
Fix default stream use in the CSV reader (#14443) @vuule
Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
Defer PTX file load to runtime (#13690) @brandon-b-miller

📖 Documentation

Disable parallel build (#14796) @vyasr
Add pylibcudf to the docs (#14791) @vyasr
Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
More doxygen fixes (#14639) @vyasr
Enable doxygen XML generation and fix issues (#14477) @vyasr
Some doxygen improvements (#14469) @vyasr
Remove warning in dask-cudf docs (#14454) @wence-
Update README links with redirects. (#14378) @bdice
Add pip install instructions to README (#13677) @shwina

🚀 New Features

Add ci check for external kernels (#14768) @robertmaynard
JSON single quote normalization API (#14729) @shrshi
Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
Don't constrain numba<0.58 (#14616) @brandon-b-miller
Add DELTALENGTHBYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
JSON quote normalization (#14545) @shrshi
Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
Implement more copying APIs in pylibcudf (#14508) @vyasr
Include writer code and writerVersion in ORC files (#14458) @vuule
Parquet sub-rowgroup reading. (#14360) @nvdbaranec
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
PARQUET-2261 Size Statistics (#14000) @etseidl
Improve GroupBy JIT error handling (#13854) @brandon-b-miller
Generate unified Python/C++ docs (#13846) @vyasr
Expand JIT groupby test suite (#13813) @brandon-b-miller

🛠️ Improvements

Pin pytest<8 (#14920) @galipremsagar
Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
Remove **kwargs from astype (#14765) @mroeschke
fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
Add pynvjitlink as a dependency (#14763) @brandon-b-miller
Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
Pin pytest-cases<3.8.2 (#14756) @mroeschke
Use fromdata instead of fromcolumns for initialzing Frame (#14755) @mroeschke
Consolidate cudf object handling in as_column (#14754) @mroeschke
Reduce execution time of Parquet C++ tests (#14750) @vuule
Implement to_datetime(..., utc=True) (#14749) @mroeschke
Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
Remove unused/single use methods (#14739) @mroeschke
refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
Remove unneeded methods in Column (#14730) @mroeschke
Clean up base column methods (#14725) @mroeschke
Ensure column.fillna signatures are consistent (#14724) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Replace asnumerical with asnumerical_column/codes (#14719) @mroeschke
Use offsetalator in gather_chars (#14700) @davidwendt
Use makestringschildren for fill() specialization logic (#14697) @davidwendt
Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
Fix call to deprecated factory function (#14695) @davidwendt
Use as_column instead of arange for range like inputs (#14689) @mroeschke
Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
Split parquet test into multiple files (#14663) @etseidl
Custom error messages for IO with nonexistent files (#14662) @vuule
Explicitly pass .dtype into isfoodtype functions (#14657) @mroeschke
Basic validation in reader benchmarks (#14647) @vuule
Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
Consolidate memoryview handling in as_column (#14643) @mroeschke
Convert FieldType to scoped enum (#14642) @vuule
Use instance over isfoodtype (#14641) @mroeschke
Use isinstance over isfoodtype internally (#14638) @mroeschke
Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
Drop nvbench patch for nvml. (#14631) @bdice
Drop Pascal GPU support. (#14630) @bdice
Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
Create strings-specific makeoffsetschild_column for multiple offset types (#14612) @davidwendt
Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
Support freq in DatetimeIndex (#14593) @shwina
Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
Remove WORKSPACE env var from cudftest tempdirectory class (#14588) @davidwendt
Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
Use cuda::proclaimreturntype on device lambdas. (#14577) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Update dependencies.yaml to new pip index (#14575) @vyasr
Simplify Python CMake (#14565) @vyasr
Java expose parquet passreadlimit (#14564) @revans2
Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
Use cudftest tempdirectory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
Fix return type of prefix increment overloads (#14544) @vuule
Make bpemergepairs_impl member private (#14543) @davidwendt
Small clean up in io::statistics (#14542) @vuule
Change json gtest environment variable to compile-time definition (#14541) @davidwendt
Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
Move non-templated inline function definitions from tableview.hpp to tableview.cpp (#14535) @davidwendt
Add JNI for strings::code_points (#14533) @thirtiseven
Add a test for issue 12773 (#14529) @vyasr
Split libarrow build dependencies. (#14506) @bdice
Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
Refactor Parquet kernel_error (#14464) @etseidl
Deprecate cudf::makestringscolumn accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
Expose stream parameter in public nvtext APIs (#14456) @davidwendt
Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
Refactor cudf.Series.init (#14450) @mroeschke
Remove the use of volatile in Parquet (#14448) @vuule
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Testing stream pool implementation (#14437) @shrshi
Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
Use isinstance(..., cudf.IntervalDtype) instead of isintervaldtype (#14424) @mroeschke
Use isinstance(..., cudf.CategoricalDtype) instead of iscategoricaldtype (#14423) @mroeschke
Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
REF: Remove instances of pd.core (#14421) @mroeschke
Expose streams in public filling APIs for label_bins (#14401) @ZelboK
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Limit DELTABINARYPACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
Expose streams in Parquet reader and writer APIs (#14359) @shrshi
Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
Expose streams in ORC reader and writer APIs (#14350) @shrshi
Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
Add cuDF devcontainers (#14015) @trxcllnt
Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
Switch to scikit-build-core (#13531) @vyasr
Simplify null count checking in column equality comparator (#13312) @vyasr

- C++
Published by raydouglass over 2 years ago

https://github.com/rapidsai/cudf - v24.02.00

🚨 Breaking Changes

Remove **kwargs from astype (#14765) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Drop Pascal GPU support. (#14630) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Deprecate cudf::makestringscolumn accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
Include writer code and writerVersion in ORC files (#14458) @vuule
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
Switch to scikit-build-core (#13531) @vyasr

🐛 Bug Fixes

Exclude tests from builds (#14981) @vyasr
Fix the bounce buffer size in ORC writer (#14947) @vuule
Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
Fix totalbytesize in Parquet row group metadata (#14802) @etseidl
Fix index difference to follow the pandas format (#14789) @amiralimi
Fix shared-workflows repo name (#14784) @raydouglass
Remove unparseable attributes from all nodes (#14780) @vyasr
Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
Fix calls to deprecated strings factory API (#14771) @davidwendt
Fix ptx file discovery in editable installs (#14767) @vyasr
Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
Enable intermediate proxies to be picklable (#14752) @shwina
Add CUDFTESTPROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
Fix CMake args (#14746) @vyasr
Fix logic bug introduced in #14730 (#14742) @wence-
[Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
Fix Groupby.get_group (#14728) @rjzamora
Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
Split cuda versions for notebook testing (#14722) @raydouglass
Fix to_numeric not preserving Series index and name (#14718) @mroeschke
Update dask-cudf wheel name (#14713) @raydouglass
Fix strings::contains matching end of string target (#14711) @davidwendt
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
Potential fix for peformance regression in #14415 (#14706) @etseidl
Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
Skip numba test that fails on ARM (#14702) @brandon-b-miller
Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
Fix nanasnull not being respected when passing arrow object (#14688) @mroeschke
Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
Add row conversion code from spark-rapids-jni (#14664) @ttnghia
Unconditionally export the CCCL path (#14656) @vyasr
Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
Fix invalid memory access in Parquet reader (#14637) @etseidl
Use columnempty over ascolumn([]) (#14632) @mroeschke
Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
Remove non-empty nulls in cudf::getjsonobject (#14609) @davidwendt
Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
Address potential race conditions in Parquet reader (#14602) @etseidl
Fix DataFrame.reindex removing column name (#14601) @mroeschke
Remove unsanitized input test data from copy gtests (#14600) @davidwendt
Fix race detected in Parquet writer (#14598) @etseidl
Correct invalid or missing return types (#14587) @robertmaynard
Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
Fixes a symbol group lookup table issue (#14561) @elstehle
Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
Improve memory footprint of isin by using contains (#14478) @wence-
Move creation of env.yaml outside the current directory (#14476) @davidwendt
Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
Correct dtype of count aggregations on empty dataframes (#14473) @wence-
Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
Fix default stream use in the CSV reader (#14443) @vuule
Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
Defer PTX file load to runtime (#13690) @brandon-b-miller

📖 Documentation

Disable parallel build (#14796) @vyasr
Add pylibcudf to the docs (#14791) @vyasr
Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
More doxygen fixes (#14639) @vyasr
Enable doxygen XML generation and fix issues (#14477) @vyasr
Some doxygen improvements (#14469) @vyasr
Remove warning in dask-cudf docs (#14454) @wence-
Update README links with redirects. (#14378) @bdice
Add pip install instructions to README (#13677) @shwina

🚀 New Features

Add ci check for external kernels (#14768) @robertmaynard
JSON single quote normalization API (#14729) @shrshi
Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
Don't constrain numba<0.58 (#14616) @brandon-b-miller
Add DELTALENGTHBYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
JSON quote normalization (#14545) @shrshi
Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
Implement more copying APIs in pylibcudf (#14508) @vyasr
Include writer code and writerVersion in ORC files (#14458) @vuule
Parquet sub-rowgroup reading. (#14360) @nvdbaranec
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
PARQUET-2261 Size Statistics (#14000) @etseidl
Improve GroupBy JIT error handling (#13854) @brandon-b-miller
Generate unified Python/C++ docs (#13846) @vyasr
Expand JIT groupby test suite (#13813) @brandon-b-miller

🛠️ Improvements

Pin pytest<8 (#14920) @galipremsagar
Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
Remove **kwargs from astype (#14765) @mroeschke
fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
Add pynvjitlink as a dependency (#14763) @brandon-b-miller
Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
Pin pytest-cases<3.8.2 (#14756) @mroeschke
Use fromdata instead of fromcolumns for initialzing Frame (#14755) @mroeschke
Consolidate cudf object handling in as_column (#14754) @mroeschke
Reduce execution time of Parquet C++ tests (#14750) @vuule
Implement to_datetime(..., utc=True) (#14749) @mroeschke
Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
Remove unused/single use methods (#14739) @mroeschke
refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
Remove unneeded methods in Column (#14730) @mroeschke
Clean up base column methods (#14725) @mroeschke
Ensure column.fillna signatures are consistent (#14724) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Replace asnumerical with asnumerical_column/codes (#14719) @mroeschke
Use offsetalator in gather_chars (#14700) @davidwendt
Use makestringschildren for fill() specialization logic (#14697) @davidwendt
Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
Fix call to deprecated factory function (#14695) @davidwendt
Use as_column instead of arange for range like inputs (#14689) @mroeschke
Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
Split parquet test into multiple files (#14663) @etseidl
Custom error messages for IO with nonexistent files (#14662) @vuule
Explicitly pass .dtype into isfoodtype functions (#14657) @mroeschke
Basic validation in reader benchmarks (#14647) @vuule
Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
Consolidate memoryview handling in as_column (#14643) @mroeschke
Convert FieldType to scoped enum (#14642) @vuule
Use instance over isfoodtype (#14641) @mroeschke
Use isinstance over isfoodtype internally (#14638) @mroeschke
Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
Drop nvbench patch for nvml. (#14631) @bdice
Drop Pascal GPU support. (#14630) @bdice
Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
Create strings-specific makeoffsetschild_column for multiple offset types (#14612) @davidwendt
Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
Support freq in DatetimeIndex (#14593) @shwina
Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
Remove WORKSPACE env var from cudftest tempdirectory class (#14588) @davidwendt
Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
Use cuda::proclaimreturntype on device lambdas. (#14577) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Update dependencies.yaml to new pip index (#14575) @vyasr
Simplify Python CMake (#14565) @vyasr
Java expose parquet passreadlimit (#14564) @revans2
Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
Use cudftest tempdirectory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
Fix return type of prefix increment overloads (#14544) @vuule
Make bpemergepairs_impl member private (#14543) @davidwendt
Small clean up in io::statistics (#14542) @vuule
Change json gtest environment variable to compile-time definition (#14541) @davidwendt
Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
Move non-templated inline function definitions from tableview.hpp to tableview.cpp (#14535) @davidwendt
Add JNI for strings::code_points (#14533) @thirtiseven
Add a test for issue 12773 (#14529) @vyasr
Split libarrow build dependencies. (#14506) @bdice
Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
Refactor Parquet kernel_error (#14464) @etseidl
Deprecate cudf::makestringscolumn accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
Expose stream parameter in public nvtext APIs (#14456) @davidwendt
Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
Refactor cudf.Series.init (#14450) @mroeschke
Remove the use of volatile in Parquet (#14448) @vuule
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Testing stream pool implementation (#14437) @shrshi
Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
Use isinstance(..., cudf.IntervalDtype) instead of isintervaldtype (#14424) @mroeschke
Use isinstance(..., cudf.CategoricalDtype) instead of iscategoricaldtype (#14423) @mroeschke
Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
REF: Remove instances of pd.core (#14421) @mroeschke
Expose streams in public filling APIs for label_bins (#14401) @ZelboK
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Limit DELTABINARYPACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
Expose streams in Parquet reader and writer APIs (#14359) @shrshi
Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
Expose streams in ORC reader and writer APIs (#14350) @shrshi
Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
Add cuDF devcontainers (#14015) @trxcllnt
Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
Switch to scikit-build-core (#13531) @vyasr
Simplify null count checking in column equality comparator (#13312) @vyasr

- C++
Published by raydouglass over 2 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v24.04.00

🔗 Links

🚨 Breaking Changes

Add future_stack to DataFrame.stack (#15015) @galipremsagar
Deprecate groupby fillna (#15000) @mroeschke
Deprecate replace with categorical columns (#14988) @mroeschke
Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
Add pandas-2.x support in cudf (#14916) @galipremsagar

🐛 Bug Fixes

Fix Index.difference to handle duplicate values when one of the inputs is empty (#15016) @galipremsagar
Add future_stack to DataFrame.stack (#15015) @galipremsagar
Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
Fix DataFrame.sort_index to respect ignore_index on all axis (#14995) @galipremsagar
Raise for pyarrow array that is tz-aware (#14980) @mroeschke
Direct SeriesGroupBy.aggregate to SeriesGroupBy.agg (#14971) @rjzamora
unset CUDF_SPILL after a pytest (#14958) @galipremsagar
Fix dask token normalization (#14829) @rjzamora
Fix 24.04 versions (#14825) @raydouglass

📖 Documentation

[DOC] Update typo in docs example of structscolumnwrapper (#14949) @karthikeyann
Update cudf.pandas FAQ. (#14940) @bdice
Optimize doc builds (#14856) @vyasr
Add developer guideline to use east const. (#14836) @bdice
Notes convert to Pandas-compat (#12641) @Touutae-lab

🚀 New Features

Implement replace in pylibcudf (#15005) @vyasr
Implement rolling in pylibcudf (#14982) @vyasr
Implement joins in pylibcudf (#14972) @vyasr
Implement scans and reductions in pylibcudf (#14970) @vyasr
Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
Implement groupby in pylibcudf (#14945) @vyasr
POC for whitespace removal in input JSON data using FST (#14931) @shrshi
Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
Migrate unary operations to pylibcudf (#14850) @vyasr
Migrate binary operations to pylibcudf (#14821) @vyasr
Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
Support CUDA 12.2 (#14712) @jameslamb

🛠️ Improvements

Change copyifsafe to call thrust instead of the overload function (#15018) @davidwendt
Fix isstringdtype test for pandas 2.2 (#15012) @mroeschke
Clean up detail sequence header inclusion (#15007) @PointKernel
Use offsetalator in cudf::strings::wrap() (#15002) @davidwendt
Use offsetalator in cudf::strings::reverse (#15001) @davidwendt
Deprecate groupby fillna (#15000) @mroeschke
Ensure to_* IO methods respect pandas 2.2 keyword only deprecation (#14999) @mroeschke
Filter all DeprecationWarning's by ArrowTable.to_pandas() (#14989) @galipremsagar
Deprecate replace with categorical columns (#14988) @mroeschke
Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
Ensure that ctest is called with --no-tests=error. (#14983) @bdice
Deprecate non-integer periods in date_range and interval_range (#14976) @galipremsagar
Use fused types for overloaded function signatures (#14969) @vyasr
Deprecate certain frequency strings (#14967) @galipremsagar
Update copyrights for 24.04. (#14964) @bdice
Introduce GetJsonObjectOptions in getJSONObject Java API (#14956) @SurajAralihalli
JNI JSON read with DataSource and infered schema, along with basic java nested Schema JSON reads (#14954) @revans2
Make codecov only informational (always pass). (#14952) @bdice
Replace legacy cudf and dask_cudf imports as (d)gd (#14944) @mroeschke
Replace isdatetime64tz/interval_dtype with isinstance (#14943) @mroeschke
Update tests for pandas 2. (#14941) @bdice
Use more public pandas APIs (#14929) @mroeschke
Add pandas-2.x support in cudf (#14916) @galipremsagar
Use offsetalator in nvtext::bytepairencoding (#14888) @davidwendt
De-DOS line-endings (#14880) @wence-
Add detail cuco_allocator (#14877) @PointKernel
Move all core types to using enum class in Cython (#14876) @vyasr
Use int64 offset types for accessing code-points in nvtext::normalize (#14868) @davidwendt
Update conda-cpp-post-build-checks to branch-24.04. (#14854) @bdice
Remove deprecated strings functions (#14848) @davidwendt
Fix CI workflows for pandas-tests and add test summary. (#14847) @bdice
Use offsetalator in cudf::strings::copy_slice (#14844) @davidwendt
Fix V2 Parquet page alignment for use with zStandard compression (#14841) @etseidl
Fix calls to deprecated strings factory API in examples. (#14838) @bdice
Update pre-commit hooks (#14837) @bdice
Use rapids_cuda_set_runtime to determine cuda runtime usage by target (#14833) @vyasr
Remove getmeminfo functions from custom memory resources (#14832) @harrism
Fix debug build by splitting rowoperatortests_utilities.cu (#14826) @davidwendt
Remove -DNVBenchENABLECUPTI=OFF. (#14820) @bdice
Branch 24.04 merge branch 24.02 (#14809) @vyasr
Branch 24.04 merge branch 24.02 (#14806) @vyasr
Use offsetalator in nvtext tokenize functions (#14783) @davidwendt
Reduce execution time of Python ORC tests (#14776) @vuule
Use offsetalator in cudf::strings::split functions (#14757) @davidwendt
Use offsetalator in cudf::strings::findall (#14745) @davidwendt
Use offsetalator in cudf::strings::url_decode (#14744) @davidwendt
Use getoffsetvalue utility in strings shift function (#14743) @davidwendt

- C++
Published by rapids-bot[bot] over 2 years ago

https://github.com/rapidsai/cudf - v23.12.01

🚨 Breaking Changes

Raise error in reindex when index is not unique (#14400) @galipremsagar
Expose stream parameter to getjsonobject API (#14297) @davidwendt
Refactor cudf_kafka to use skbuild (#14292) @jdye64
Expose stream parameter in public strings convert APIs (#14255) @davidwendt
Upgrade to nvCOMP 3.0.4 (#13815) @vuule

🐛 Bug Fixes

Fix synchronization issue when writing string columns with dictionary to ORC (#14595) @vuule
Update actions/labeler to v4 (#14562) @raydouglass
Fix data corruption when skipping rows (#14557) @etseidl
Fix function name typo in cudf.pandas profiler (#14514) @galipremsagar
Fix intermediate type checking in expression parsing (#14445) @vyasr
Forward merge branch-23.10 into branch-23.12 (#14435) @raydouglass
Remove needs: wheel-build-cudf. (#14427) @bdice
Fix dask dependency in custreamz (#14420) @vyasr
Ensure nvbench initializes nvml context when built statically (#14411) @robertmaynard
Support java AST String literal with desired encoding (#14402) @winningsix
Raise error in reindex when index is not unique (#14400) @galipremsagar
Always build nvbench statically so we don't need to package it (#14399) @robertmaynard
Fix token-count logic in nvtext::tokenizewithvocabulary (#14393) @davidwendt
Fix as_column(pd.Timestamp/Timedelta, length=) not respecting length (#14390) @mroeschke
cudf.pandas: cuDF subpath checking in module __getattr__ (#14388) @shwina
Fix and disable encoding for nanosecond statistics in ORC writer (#14367) @vuule
Add the new manylinux builds to the build job (#14351) @vyasr
cudf jit parser now supports .pragma instructions with quotes (#14348) @robertmaynard
Fix overflow check in cudf::merge (#14345) @divyegala
Add cramjam (#14344) @vyasr
Enable dask_cudf/io pytests in CI (#14338) @galipremsagar
Temporarily avoid the current build of pydata-sphinx-theme (#14332) @vyasr
Fix host buffer access from device function in the Parquet reader (#14328) @vuule
Run IO tests for Dask-cuDF (#14327) @rjzamora
Fix logical type issues in the Parquet writer (#14322) @vuule
Remove aws-sdk-pinning and revert to arrow 12.0.1 (#14319) @vyasr
test is_valid before reading column data (#14318) @etseidl
Fix gtest validity setting for TextTokenizeTest.Vocabulary (#14312) @davidwendt
Fixes stack context for json lines format that recovers from invalid JSON lines (#14309) @elstehle
Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
fixing thread index overflow issue (#14290) @hyperbolic2346
Fix memset error in nvtext::editdistancematrix (#14283) @davidwendt
Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
Handle empty string correctly in Parquet statistics (#14257) @etseidl
Fixes behaviour for incomplete lines when recover_with_nulls is enabled (#14252) @elstehle
cudf::detail::pinned_allocator doesn't throw from deallocate (#14251) @robertmaynard
Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
Fix the precision when converting a decimal128 column to an arrow array (#14230) @jihoonson
Fixing parquet list of struct interpretation (#13715) @hyperbolic2346

📖 Documentation

Fix io reference in docs. (#14452) @bdice
Update README (#14374) @shwina
Example code for blog on new row comparators (#13795) @divyegala

🚀 New Features

Expose streams in public unary APIs (#14342) @vyasr
Add python tests for Parquet DELTABINARYPACKED encoder (#14316) @etseidl
Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
Expose streams in public null mask APIs (#14263) @vyasr
Expose streams in binaryop APIs (#14187) @vyasr
Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
Add decoder for DELTABYTEARRAY to Parquet reader (#14101) @etseidl
Add DELTABINARYPACKED encoder for Parquet writer (#14100) @etseidl
Add BytePairEncoder class to cuDF (#13891) @davidwendt
Upgrade to nvCOMP 3.0.4 (#13815) @vuule
Use pynvjitlink for CUDA 12+ MVC (#13650) @brandon-b-miller

🛠️ Improvements

Build concurrency for nightly and merge triggers (#14441) @bdice
Cleanup remaining usages of dask dependencies (#14407) @galipremsagar
Update to Arrow 14.0.1. (#14387) @bdice
Remove Cython libcpp wrappers (#14382) @vyasr
Forward-merge branch-23.10 to branch-23.12 (#14372) @bdice
Upgrade to arrow 14 (#14371) @galipremsagar
Fix a pytest typo in test_kurt_skew_error (#14368) @galipremsagar
Use new rapids-dask-dependency metapackage for managing dask versions (#14364) @vyasr
Change nullable() to has_nulls() in cudf::detail::gather (#14363) @divyegala
Split up scan_inclusive.cu to improve its compile time (#14358) @davidwendt
Implement userdatasourcewrapper isempty() and isdevicereadpreferred(). (#14357) @tpn
Added streams to CSV reader and writer api (#14340) @shrshi
Upgrade wheels to use arrow 13 (#14339) @vyasr
Rework nvtext::bytepairencoding API (#14337) @davidwendt
Improve performance of nvtext::tokenizewithvocabulary for long strings (#14336) @davidwendt
Upgrade arrow to 13 (#14330) @galipremsagar
Expose stream parameter in public nvtext replace APIs (#14329) @davidwendt
Drop pyorc dependency and use pandas/pyarrow instead (#14323) @galipremsagar
Avoid pyarrow.fs import for local storage (#14321) @rjzamora
Unpin dask and distributed for 23.12 development (#14320) @galipremsagar
Expose stream parameter in public nvtext tokenize APIs (#14317) @davidwendt
Added streams to JSON reader and writer api (#14313) @shrshi
Minor improvements in source_info (#14308) @vuule
Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
Add stream parameter to Set Operations (Public List APIs) (#14305) @SurajAralihalli
Expose stream parameter to getjsonobject API (#14297) @davidwendt
Sort dictionary data alphabetically in the ORC writer (#14295) @vuule
Expose stream parameter in public strings filter APIs (#14293) @davidwendt
Refactor cudf_kafka to use skbuild (#14292) @jdye64
Update shared-action-workflows references (#14289) @AyodeAwe
Register partd encode dispatch in dask_cudf (#14287) @rjzamora
Update versioning strategy (#14285) @vyasr
Move and rename byte-pair-encoding source files (#14284) @davidwendt
Expose stream parameter in public strings combine APIs (#14281) @davidwendt
Expose stream parameter in public strings contains APIs (#14280) @davidwendt
Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
Use branch-23.12 workflows. (#14271) @bdice
Refactor LogicalType for Parquet (#14264) @etseidl
Centralize chunked reading code in the parquet reader to readerimplchunking.cu (#14262) @nvdbaranec
Expose stream parameter in public strings replace APIs (#14261) @davidwendt
Expose stream parameter in public strings APIs (#14260) @davidwendt
Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
Make parquet schema index type consistent (#14256) @hyperbolic2346
Expose stream parameter in public strings convert APIs (#14255) @davidwendt
Add in java bindings for DataSource (#14254) @revans2
Reimplement cudf::merge for nested types without using comparators (#14250) @divyegala
Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
Improve contains_column by invoking contains_table (#14238) @PointKernel
Detect and report errors in Parquet header parsing (#14237) @etseidl
Normalizing offsets iterator (#14234) @davidwendt
Forward merge 23.10 into 23.12 (#14231) @galipremsagar
Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
Enable indexalator for device code (#14206) @davidwendt
Marginally reduce memory footprint of joins (#14197) @wence-
Add nvtx annotations to spilling-based data movement (#14196) @wence-
Optimize ORC writer for decimal columns (#14190) @vuule
Remove the use of volatile in ORC (#14175) @vuule
Add bytes_per_second to distinctcount of streamcompaction nvbench. (#14172) @Blonck
Add bytes_per_second to transpose benchmark (#14170) @Blonck
cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
Add bytes_per_second to shift benchmark (#13950) @Blonck
Extract debug_utilities.hpp/cu from column_utilities.hpp/cu (#13720) @ttnghia

- C++
Published by raydouglass over 2 years ago

https://github.com/rapidsai/cudf - v23.12.00

🚨 Breaking Changes

Raise error in reindex when index is not unique (#14400) @galipremsagar
Expose stream parameter to getjsonobject API (#14297) @davidwendt
Refactor cudf_kafka to use skbuild (#14292) @jdye64
Expose stream parameter in public strings convert APIs (#14255) @davidwendt
Upgrade to nvCOMP 3.0.4 (#13815) @vuule

🐛 Bug Fixes

Update actions/labeler to v4 (#14562) @raydouglass
Fix data corruption when skipping rows (#14557) @etseidl
Fix function name typo in cudf.pandas profiler (#14514) @galipremsagar
Fix intermediate type checking in expression parsing (#14445) @vyasr
Forward merge branch-23.10 into branch-23.12 (#14435) @raydouglass
Remove needs: wheel-build-cudf. (#14427) @bdice
Fix dask dependency in custreamz (#14420) @vyasr
Ensure nvbench initializes nvml context when built statically (#14411) @robertmaynard
Support java AST String literal with desired encoding (#14402) @winningsix
Raise error in reindex when index is not unique (#14400) @galipremsagar
Always build nvbench statically so we don't need to package it (#14399) @robertmaynard
Fix token-count logic in nvtext::tokenizewithvocabulary (#14393) @davidwendt
Fix as_column(pd.Timestamp/Timedelta, length=) not respecting length (#14390) @mroeschke
cudf.pandas: cuDF subpath checking in module __getattr__ (#14388) @shwina
Fix and disable encoding for nanosecond statistics in ORC writer (#14367) @vuule
Add the new manylinux builds to the build job (#14351) @vyasr
cudf jit parser now supports .pragma instructions with quotes (#14348) @robertmaynard
Fix overflow check in cudf::merge (#14345) @divyegala
Add cramjam (#14344) @vyasr
Enable dask_cudf/io pytests in CI (#14338) @galipremsagar
Temporarily avoid the current build of pydata-sphinx-theme (#14332) @vyasr
Fix host buffer access from device function in the Parquet reader (#14328) @vuule
Run IO tests for Dask-cuDF (#14327) @rjzamora
Fix logical type issues in the Parquet writer (#14322) @vuule
Remove aws-sdk-pinning and revert to arrow 12.0.1 (#14319) @vyasr
test is_valid before reading column data (#14318) @etseidl
Fix gtest validity setting for TextTokenizeTest.Vocabulary (#14312) @davidwendt
Fixes stack context for json lines format that recovers from invalid JSON lines (#14309) @elstehle
Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
fixing thread index overflow issue (#14290) @hyperbolic2346
Fix memset error in nvtext::editdistancematrix (#14283) @davidwendt
Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
Handle empty string correctly in Parquet statistics (#14257) @etseidl
Fixes behaviour for incomplete lines when recover_with_nulls is enabled (#14252) @elstehle
cudf::detail::pinned_allocator doesn't throw from deallocate (#14251) @robertmaynard
Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
Fix the precision when converting a decimal128 column to an arrow array (#14230) @jihoonson
Fixing parquet list of struct interpretation (#13715) @hyperbolic2346

📖 Documentation

Fix io reference in docs. (#14452) @bdice
Update README (#14374) @shwina
Example code for blog on new row comparators (#13795) @divyegala

🚀 New Features

Expose streams in public unary APIs (#14342) @vyasr
Add python tests for Parquet DELTABINARYPACKED encoder (#14316) @etseidl
Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
Expose streams in public null mask APIs (#14263) @vyasr
Expose streams in binaryop APIs (#14187) @vyasr
Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
Add decoder for DELTABYTEARRAY to Parquet reader (#14101) @etseidl
Add DELTABINARYPACKED encoder for Parquet writer (#14100) @etseidl
Add BytePairEncoder class to cuDF (#13891) @davidwendt
Upgrade to nvCOMP 3.0.4 (#13815) @vuule
Use pynvjitlink for CUDA 12+ MVC (#13650) @brandon-b-miller

🛠️ Improvements

Build concurrency for nightly and merge triggers (#14441) @bdice
Cleanup remaining usages of dask dependencies (#14407) @galipremsagar
Update to Arrow 14.0.1. (#14387) @bdice
Remove Cython libcpp wrappers (#14382) @vyasr
Forward-merge branch-23.10 to branch-23.12 (#14372) @bdice
Upgrade to arrow 14 (#14371) @galipremsagar
Fix a pytest typo in test_kurt_skew_error (#14368) @galipremsagar
Use new rapids-dask-dependency metapackage for managing dask versions (#14364) @vyasr
Change nullable() to has_nulls() in cudf::detail::gather (#14363) @divyegala
Split up scan_inclusive.cu to improve its compile time (#14358) @davidwendt
Implement userdatasourcewrapper isempty() and isdevicereadpreferred(). (#14357) @tpn
Added streams to CSV reader and writer api (#14340) @shrshi
Upgrade wheels to use arrow 13 (#14339) @vyasr
Rework nvtext::bytepairencoding API (#14337) @davidwendt
Improve performance of nvtext::tokenizewithvocabulary for long strings (#14336) @davidwendt
Upgrade arrow to 13 (#14330) @galipremsagar
Expose stream parameter in public nvtext replace APIs (#14329) @davidwendt
Drop pyorc dependency and use pandas/pyarrow instead (#14323) @galipremsagar
Avoid pyarrow.fs import for local storage (#14321) @rjzamora
Unpin dask and distributed for 23.12 development (#14320) @galipremsagar
Expose stream parameter in public nvtext tokenize APIs (#14317) @davidwendt
Added streams to JSON reader and writer api (#14313) @shrshi
Minor improvements in source_info (#14308) @vuule
Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
Add stream parameter to Set Operations (Public List APIs) (#14305) @SurajAralihalli
Expose stream parameter to getjsonobject API (#14297) @davidwendt
Sort dictionary data alphabetically in the ORC writer (#14295) @vuule
Expose stream parameter in public strings filter APIs (#14293) @davidwendt
Refactor cudf_kafka to use skbuild (#14292) @jdye64
Update shared-action-workflows references (#14289) @AyodeAwe
Register partd encode dispatch in dask_cudf (#14287) @rjzamora
Update versioning strategy (#14285) @vyasr
Move and rename byte-pair-encoding source files (#14284) @davidwendt
Expose stream parameter in public strings combine APIs (#14281) @davidwendt
Expose stream parameter in public strings contains APIs (#14280) @davidwendt
Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
Use branch-23.12 workflows. (#14271) @bdice
Refactor LogicalType for Parquet (#14264) @etseidl
Centralize chunked reading code in the parquet reader to readerimplchunking.cu (#14262) @nvdbaranec
Expose stream parameter in public strings replace APIs (#14261) @davidwendt
Expose stream parameter in public strings APIs (#14260) @davidwendt
Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
Make parquet schema index type consistent (#14256) @hyperbolic2346
Expose stream parameter in public strings convert APIs (#14255) @davidwendt
Add in java bindings for DataSource (#14254) @revans2
Reimplement cudf::merge for nested types without using comparators (#14250) @divyegala
Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
Improve contains_column by invoking contains_table (#14238) @PointKernel
Detect and report errors in Parquet header parsing (#14237) @etseidl
Normalizing offsets iterator (#14234) @davidwendt
Forward merge 23.10 into 23.12 (#14231) @galipremsagar
Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
Enable indexalator for device code (#14206) @davidwendt
Marginally reduce memory footprint of joins (#14197) @wence-
Add nvtx annotations to spilling-based data movement (#14196) @wence-
Optimize ORC writer for decimal columns (#14190) @vuule
Remove the use of volatile in ORC (#14175) @vuule
Add bytes_per_second to distinctcount of streamcompaction nvbench. (#14172) @Blonck
Add bytes_per_second to transpose benchmark (#14170) @Blonck
cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
Add bytes_per_second to shift benchmark (#13950) @Blonck
Extract debug_utilities.hpp/cu from column_utilities.hpp/cu (#13720) @ttnghia

- C++
Published by raydouglass over 2 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v24.02.00

🔗 Links

🚨 Breaking Changes

Expunge as_frame conversions in Column algorithms (#14491) @wence-
Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke

🐛 Bug Fixes

Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
Improve memory footprint of isin by using contains (#14478) @wence-
Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
Correct dtype of count aggregations on empty dataframes (#14473) @wence-
Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
Fix default stream use in the CSV reader (#14443) @vuule
Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke

📖 Documentation

Some doxygen improvements (#14469) @vyasr
Remove warning in dask-cudf docs (#14454) @wence-
Update README links with redirects. (#14378) @bdice

🚀 New Features

Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov

🛠️ Improvements

Split libarrow build dependencies. (#14506) @bdice
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
Refactor Parquet kernel_error (#14464) @etseidl
Remove deprecated nvtext::loadmergepairs_file (#14460) @davidwendt
Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
Expose stream parameter in public nvtext APIs (#14456) @davidwendt
Remove the use of volatile in Parquet (#14448) @vuule
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
REF: Remove instances of pd.core (#14421) @mroeschke
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
Add cuDF devcontainers (#14015) @trxcllnt

- C++
Published by rapids-bot[bot] over 2 years ago

https://github.com/rapidsai/cudf - v23.10.02

🚨 Breaking Changes

Raise error in reindex when index is not unique (#14429) @galipremsagar
Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Create tableinputmetadata from a table_metadata (#13920) @etseidl
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Update to Cython 3.0.0 (#13777) @vyasr
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

🐛 Bug Fixes

Raise error in reindex when index is not unique (#14429) @galipremsagar
Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
Fix inaccuracy in decimal128 rounding. (#14233) @bdice
Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
Fix pytorch related pytest (#14198) @galipremsagar
Pin to aws-sdk-cpp<1.11 (#14173) @pentschev
Fix assert failure for range window functions (#14168) @mythrocks
Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
Fix kernel launch error for cudf::io::orc::gpu::rowgroupcharcounts_kernel (#14139) @davidwendt
Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
Fix DataFrame.values with no columns but index (#14134) @mroeschke
Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
Preserve name of the column while initializing a DataFrame (#14110) @galipremsagar
Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
Drop kwargs from Series.count (#14106) @galipremsagar
Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
Only use memory resources that haven't been freed (#14103) @robertmaynard
Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
Validate ignoreindex type in dropduplicates (#14098) @mroeschke
Fix renaming Series and Index (#14080) @galipremsagar
Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
Raise NotImplementedError in todatetime with dayfirst without inferformat (#14058) @mroeschke
Fix various issues in Index.intersection (#14054) @galipremsagar
Fix Index.difference to match with pandas (#14053) @galipremsagar
Fix empty string column construction (#14052) @galipremsagar
Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Ignore compile_commands.json (#14048) @harrism
Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
Implement sort_remaining for sort_index (#14033) @wence-
Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
Fix return type of MultiIndex.difference (#14009) @galipremsagar
Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
Fix map column can not be non-nullable for java (#14003) @res-life
Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
Use threadindextype to avoid out of bounds accesses in conditional joins (#13971) @vyasr
Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
Use cudf::threadindextype in getjsonobject and tdigest kernels (#13962) @nvdbaranec
Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
Fixed processed bytes calculation in APPLYBOOLEANMASK benchmark. (#13937) @Blonck
Fix construction of Grouping objects (#13932) @galipremsagar
Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
Fix handling of typecasting in searchsorted (#13925) @galipremsagar
Preserve index name in reindex (#13917) @galipremsagar
Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
Use cudf::threadindextype in concatenate.cu. (#13906) @bdice
Use cudf::threadindextype in replace.cu. (#13905) @bdice
Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
Fix nvtext::generatecharacterngrams performance regression for longer strings (#13874) @davidwendt
Fix return type of MultiIndex.levels (#13870) @galipremsagar
Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
Fix binary operations between Series and Index (#13842) @galipremsagar
Update makelistscolumnfromscalar to use makeoffsetschild_column utility (#13841) @davidwendt
Fix read out of bounds in string concatenate (#13838) @pentschev
Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
Fix cuFile I/O factories (#13829) @vuule
DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
Branch 23.10 merge 23.08 (#13822) @vyasr
Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
Raise error when mixed types are being constructed (#13816) @galipremsagar
Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Fix negative unary operation for boolean type (#13780) @galipremsagar
Fix contains(in) method for Series (#13779) @galipremsagar
Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
Preserve names of column object in various APIs (#13772) @galipremsagar
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
Provide our own Cython declaration for make_unique (#13746) @wence-

📖 Documentation

Fix benchmark image. (#14376) @bdice
Fix typo in docstring: metadata. (#14025) @bdice
Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
Simplify Python doc configuration (#13826) @vyasr
Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
Fix all warnings in Python docs (#13789) @vyasr

🚀 New Features

[Java] Add JNI bindings for integers_to_hex (#14205) @razajafri
Propagate errors from Parquet reader kernels back to host (#14167) @vuule
JNI for HISTOGRAM and MERGE_HISTOGRAM aggregations (#14154) @ttnghia
Expose streams in all public sorting APIs (#14146) @vyasr
Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
Implement GroupBy.value_counts to match pandas API (#14114) @stmio
Refactor parquet thrift reader (#14097) @etseidl
Refactor hash_reduce_by_row (#14095) @ttnghia
Support negative preceding/following for ROW window functions (#14093) @mythrocks
Support for progressive parquet chunked reading. (#14079) @nvdbaranec
Implement HISTOGRAM and MERGE_HISTOGRAM aggregations (#14045) @ttnghia
Expose streams in public search APIs (#14034) @vyasr
Expose streams in public replace APIs (#14010) @vyasr
Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
Expose streams in public filling APIs (#13990) @vyasr
Expose streams in public concatenate APIs (#13987) @vyasr
Use HostMemoryAllocator in jni::allocatehostbuffer (#13975) @gerashegalov
Enable fractional null probability for hashing benchmark (#13967) @Blonck
Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
Add nvtext::tokenizewithvocabulary API (#13930) @davidwendt
Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
Add HostMemoryAllocator interface (#13924) @gerashegalov
Global stream pool (#13922) @etseidl
Create tableinputmetadata from a table_metadata (#13920) @etseidl
Translate column size overflow exception to JNI (#13911) @mythrocks
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Exclude some tests from running with the compute sanitizer (#13872) @firestarman
Expand statistics support in ORC writer (#13848) @vuule
Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
Add cudf::strings::find function with target per row (#13808) @davidwendt
Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
Support corr in GroupBy.apply through the jit engine (#13767) @shwina
Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
Support more numeric types in Groupby.apply with engine='jit' (#13729) @brandon-b-miller
[FEA] Add DELTABINARYPACKED decoding support to Parquet reader (#13637) @etseidl
Read FIXEDLENBYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

🛠️ Improvements

Update shared-action-workflows references (backport from 23.12 to 23.10) (#14300) @AyodeAwe
Pin dask and distributed for 23.10 release (#14225) @galipremsagar
update rmm tag path (#14195) @AyodeAwe
Disable Recently Updated Check (#14193) @ajschmidt8
Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
Add Parquet reader benchmarks for row selection (#14147) @vuule
Update image names (#14145) @AyodeAwe
Support callables in DataFrame.assign (#14142) @wence-
Reduce memory usage of ascategoricalcolumn (#14138) @wence-
Replace Python scalar conversions with libcudf (#14124) @vyasr
Update to clang 16.0.6. (#14120) @bdice
Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
Add stream parameter to external dict APIs (#14115) @SurajAralihalli
Add fallback matrix for nvcomp. (#14082) @bdice
[Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
Remove header tests (#14072) @ajschmidt8
Refactor contains_table with cuco::static_set (#14064) @PointKernel
Remove debug print in a Parquet test (#14063) @vuule
Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Expose stream parameter in public strings find APIs (#14060) @davidwendt
Update doxygen to 1.9.1 (#14059) @vyasr
Remove the mr from the base fixture (#14057) @vyasr
Expose streams in public strings case APIs (#14056) @davidwendt
Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
Use cudf::makeemptycolumn instead of column_view constructor (#14030) @davidwendt
Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
Explicitly depend on zlib in conda recipes (#14018) @wence-
Use grid_stride for stride computations. (#13996) @bdice
Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
Use thread_index_type in partitioning.cu (#13973) @divyegala
Use cudf::thread_index_type in merge.cu (#13972) @divyegala
Use copy-pr-bot (#13970) @ajschmidt8
Use cudf::threadindextype in strings custom kernels (#13968) @davidwendt
Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
Added pinned pool reservation API for java (#13964) @revans2
Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
Add bytes_per_second to copyifelse benchmark (#13960) @Blonck
Add pandas compatible output to Series.unique (#13959) @galipremsagar
Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
Make HostColumnVector.getRefCount public (#13934) @abellina
Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
Add java API to get size of host memory needed to copy column view (#13919) @revans2
Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
Enable hugepage for arrow host allocations (#13914) @madsbk
Improve performance of nvtext::edit_distance (#13912) @davidwendt
Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
Use empty() instead of size() where possible (#13908) @vuule
[JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
Allow explicit shuffle="p2p" within dask-cudf API (#13893) @rjzamora
Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
Fixes a performance regression in FST (#13850) @elstehle
Set native handles to null on close in Java wrapper classes (#13818) @jlowe
Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
Update lists::contains to experimental row comparator (#13810) @divyegala
Reduce lists::contains dispatches for scalars (#13805) @divyegala
Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
Update to Cython 3.0.0 (#13777) @vyasr
Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
Branch 23.10 merge 23.08 (#13773) @vyasr
Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
Branch 23.10 merge 23.08 (#13753) @vyasr
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Refactors JSON reader's pushdown automaton (#13716) @elstehle
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

- C++
Published by raydouglass over 2 years ago

https://github.com/rapidsai/cudf - v23.04.01

🚨 Breaking Changes

Pin dask and distributed for release (#13070) @galipremsagar
Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
Update minimum pandas and numpy pinnings (#12887) @galipremsagar
Deprecate names & dtype in Index.copy (#12825) @galipremsagar
Deprecate Index.is_* methods (#12820) @galipremsagar
Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
Deprecate na_sentinel in factorize (#12817) @galipremsagar
Make string methods return a Series with a useful Index (#12814) @shwina
Produce useful guidance on overflow error in to_csv (#12705) @wence-
Move strings_udf code into cuDF (#12669) @brandon-b-miller
Remove cudf::strings::repeatstringsoutputsizes and optional parameter from cudf::strings::repeatstrings (#12609) @davidwendt
Replace message parsing with throwing more specific exceptions (#12426) @vyasr

🐛 Bug Fixes

Pin curand version (#13127) @vyasr
Fix memcheck script to execute only _TEST files found in bin/gtests/libcudf (#13006) @davidwendt
Fix DataFrame constructor to broadcast scalar inputs properly (#12997) @galipremsagar
Drop force_nullable_schema from chunked parquet writer (#12996) @galipremsagar
Fix gtest column utility comparator diff reporting (#12995) @davidwendt
Handle index names while performing groupby (#12992) @galipremsagar
Fix __setitem__ on string columns when the scalar value ends in a null byte (#12991) @wence-
Fix sort_values when column is all empty strings (#12988) @eriknw
Remove unused variable and fix memory issue in ORC writer (#12984) @ttnghia
Pre-emptive fix for upstream dask.dataframe.read_parquet changes (#12983) @rjzamora
Remove MANIFEST.in use auto-generated one for sdists and package_data for wheels (#12960) @vyasr
Update to use rapids-export(COMPONENTS) feature. (#12959) @robertmaynard
cudftestutil supports static gtest dependencies (#12957) @robertmaynard
Include gtest in build environment. (#12956) @vyasr
Correctly handle scalar indices in Index.__getitem__ (#12955) @wence-
Avoid building cython twice (#12945) @galipremsagar
Fix set index error for Series rolling window operations (#12942) @galipremsagar
Fix calculation of null counts for Parquet statistics (#12938) @etseidl
Preserve integer dtype of hive-partitioned column containing nulls (#12930) @rjzamora
Use getcurrentdeviceresource for intermediate allocations in COLLECTLIST window code (#12927) @karthikeyann
Mark dlpack tensor deleter as noexcept to match PyCapsule_Destructor signature. (#12921) @bdice
Fix conda recipe post-link.sh typo (#12916) @pentschev
minrows and numrows are swapped in ComputePageSizes declaration in Parquet reader (#12886) @etseidl
Expect cupy to now support bool arrays for dlpack. (#12883) @vyasr
Use python -m pytest for nightly wheel tests (#12871) @bdice
Parquet writer columnsize() should return a sizet (#12870) @etseidl
Fix cudf::hash_partition kernel launch error with decimal128 types (#12863) @davidwendt
Fix an issue with parquet chunked reader undercounting string lengths. (#12859) @nvdbaranec
Remove tokenizers pre-install pinning. (#12854) @vyasr
Fix parquet RangeIndex bug (#12838) @rjzamora
Remove KAFKAHOSTTEST from compute-sanitizer check (#12831) @davidwendt
Make string methods return a Series with a useful Index (#12814) @shwina
Tell cudf_kafka to use header-only fmt (#12796) @vyasr
Add GroupBy.dtypes (#12783) @galipremsagar
Fix a leak in a test and clarify some test names (#12781) @revans2
Fix bug in all-null list due to joinlistelements special handling (#12767) @karthikeyann
Add try/except for expected null-schema error in read_parquet (#12756) @rjzamora
Throw an exception if an unsupported page encoding is detected in Parquet reader (#12754) @etseidl
Fix a bug with num_keys in _scatter_by_slice (#12749) @thomcom
Bump pinned rapids wheel deps to 23.4 (#12735) @sevagh
Rework logic in cudf::strings::split_record to improve performance (#12729) @davidwendt
Add always_nullable flag to Dremel encoding (#12727) @divyegala
Fix memcheck read error in compound segmented reduce (#12722) @davidwendt
Fix faulty conditional logic in JIT GroupBy.apply (#12706) @brandon-b-miller
Produce useful guidance on overflow error in to_csv (#12705) @wence-
Handle parquet list data corner case (#12698) @nvdbaranec
Fix missing trailing comma in json writer (#12688) @karthikeyann
Remove child fom newCudaAsyncMemoryResource (#12681) @abellina
Handle bool types in round API (#12670) @galipremsagar
Ensure all of device bitmask is initialized in from_arrow (#12668) @wence-
Fix from_arrow to load a sliced arrow table (#12665) @galipremsagar
Fix dask-cudf read_parquet bug for multi-file aggregation (#12663) @rjzamora
Fix AllocateLikeTest gtests reading uninitialized null-mask (#12643) @davidwendt
Fix find_common_dtype and values to handle complex dtypes (#12537) @galipremsagar
Fix fetching of MultiIndex values when a label is passed (#12521) @galipremsagar
Fix Series comparison vs scalars (#12519) @brandon-b-miller
Allow casting from UDFString back to StringView to call methods in strings_udf (#12363) @brandon-b-miller

📖 Documentation

Fix GroupBy.apply doc examples rendering (#12994) @brandon-b-miller
add sphinx building and s3 uploading for dask-cudf docs (#12982) @quasiben
Add developer documentation forbidding default parameters in detail APIs (#12978) @vyasr
Add README symlink for dask-cudf. (#12946) @bdice
Remove return type from @return doxygen tags (#12908) @davidwendt
Fix docs build to be pydata-sphinx-theme=0.13.0 compatible (#12874) @galipremsagar
Add skeleton API and prose documentation for dask-cudf (#12725) @wence-
Enable doctests for GroupBy methods (#12658) @brandon-b-miller
Add comment about CUB patch for SegmentedSortInt.Bool gtest (#12611) @davidwendt

🚀 New Features

Add JNI method for strings::replace multi variety (#12979) @NVnavkumar
Add nunique aggregation support for cudf::segmented_reduce (#12972) @davidwendt
Refactor orc chunked writer (#12949) @ttnghia
Make Parquet writer nullable option application to single table writes (#12933) @vuule
Refactor io::orc::ProtobufWriter (#12877) @ttnghia
Make timezone table independent from ORC (#12805) @vuule
Cache JIT GroupBy.apply functions (#12802) @brandon-b-miller
Implement initial support for avro logical types (#6482) (#12788) @tpn
Update tests/column_utilities to use experimental::equality row comparator (#12777) @divyegala
Update distinct/unique_count to experimental::row hasher/comparator (#12776) @divyegala
Update hash_partition to use experimental::row::row_hasher (#12761) @divyegala
Update is_sorted to use experimental::row::lexicographic (#12752) @divyegala
Update default data source in cuio reader benchmarks (#12740) @PointKernel
Reenable stream identification library in CI (#12714) @vyasr
Add regex_program strings splitting java APIs and tests (#12713) @cindyyuanjiang
Add regex_program strings replacing java APIs and tests (#12701) @cindyyuanjiang
Add regex_program strings extract java APIs and tests (#12699) @cindyyuanjiang
Variable fragment sizes for Parquet writer (#12685) @etseidl
Add segmented reduction support for fixed-point types (#12680) @davidwendt
Move strings_udf code into cuDF (#12669) @brandon-b-miller
Add regex_program searching APIs and related java classes (#12666) @cindyyuanjiang
Add logging to libcudf (#12637) @vuule
Add compound aggregations to cudf::segmented_reduce (#12573) @davidwendt
Convert rank to use to experimental row comparators (#12481) @divyegala
Use rapids-cmake parallel testing feature (#12451) @robertmaynard
Enable detection of undesired stream usage (#12089) @vyasr

🛠️ Improvements

Pin dask and distributed for release (#13070) @galipremsagar
Pin cupy in wheel tests to supported versions (#13041) @vyasr
Pin numba version (#13001) @vyasr
Rework gtests SequenceTest to remove using namepace cudf (#12985) @davidwendt
Stop setting package version attribute in wheels (#12977) @vyasr
Move detail reduction functions to cudf::reduction::detail namespace (#12971) @davidwendt
Remove default detail mrs: part7 (#12970) @vyasr
Remove default detail mrs: part6 (#12969) @vyasr
Remove default detail mrs: part5 (#12968) @vyasr
Remove default detail mrs: part4 (#12967) @vyasr
Remove default detail mrs: part3 (#12966) @vyasr
Remove default detail mrs: part2 (#12965) @vyasr
Remove default detail mrs: part1 (#12964) @vyasr
Add force_nullable_schema parameter to Parquet writer. (#12952) @galipremsagar
Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
Remove remaining default stream parameters (#12943) @vyasr
Fix cudf::segmented_reduce gtest for ANY aggregation (#12940) @davidwendt
Implement groupby.head and groupby.tail (#12939) @wence-
Fix libcudf gtests to pass null-count=0 for empty validity masks (#12923) @davidwendt
Migrate parquet encoding to use experimental row operators (#12918) @PointKernel
Fix benchmarks coded in namespace cudf and using namespace cudf (#12915) @karthikeyann
Fix io/text gtests coded in namespace cudf::test (#12914) @karthikeyann
Pass SCCACHE_S3_USE_SSL to conda builds (#12910) @ajschmidt8
Fix FST, JSON gtests & benchmarks coded in namespace cudf::test (#12907) @karthikeyann
Generate pyproject dependencies using dfg (#12906) @vyasr
Update libcudf counting functions to specify cudf::size_type (#12904) @davidwendt
Fix moto env vars & pass AWS_SESSION_TOKEN to conda builds (#12902) @ajschmidt8
Rewrite CSV writer benchmark with nvbench (#12901) @PointKernel
Rework some code logic to reduce iterator and comparator inlining to improve compile time (#12900) @davidwendt
Deprecate line_terminator in favor of lineterminator in to_csv (#12896) @wence-
Add stream and mr parameters for structs::detail::flatten_nested_columns (#12892) @ttnghia
Deprecate libcudf regex APIs accepting pattern strings directly (#12891) @davidwendt
Remove default parameters from detail headers in include (#12888) @vyasr
Update minimum pandas and numpy pinnings (#12887) @galipremsagar
Implement groupby.sample (#12882) @wence-
Update JNI build ENV default to gcc 11 (#12881) @pxLi
Change return type of cudf::structs::detail::flatten_nested_columns to smart pointer (#12878) @ttnghia
Fix passing seed parameter to MurmurHash3_32 in cudf::hash() function (#12875) @davidwendt
Remove manual artifact upload step in CI (#12869) @ajschmidt8
Update to GCC 11 (#12868) @bdice
Fix null hive-partition behavior in dask-cudf parquet (#12866) @rjzamora
Update to protobuf>=4.21.6,<4.22. (#12864) @bdice
Update RMM allocators (#12861) @pentschev
Improve performance for replace-multi for long strings (#12858) @davidwendt
Drop Python 3.7 handling for pickle protocol 4 (#12857) @jakirkham
Migrate as much as possible to pyproject.toml (#12850) @vyasr
Enable nbqa pre-commit hooks for isort and black. (#12848) @bdice
Setting a threshold for KvikIO IO (#12841) @madsbk
Update datasets download URL (#12840) @jjacobelli
Make docs builds less verbose (#12836) @AyodeAwe
Consolidate linter configs into pyproject.toml (#12834) @vyasr
Deprecate names & dtype in Index.copy (#12825) @galipremsagar
Deprecate inplace parameters in categorical methods (#12824) @galipremsagar
Add optional text file support to ninja-log utility (#12823) @davidwendt
Deprecate Index.is_* methods (#12820) @galipremsagar
Add dfg as a pre-commit hook (#12819) @vyasr
Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
Deprecate na_sentinel in factorize (#12817) @galipremsagar
Shuffling read into a sub function in parquet read (#12809) @hyperbolic2346
Fixing parquet coalescing of reads (#12808) @hyperbolic2346
CI: Remove specification of manual stage for check_style.sh script. (#12803) @csadorf
Add compute-sanitizer github workflow action to nightly tests (#12800) @davidwendt
Enable groupby std and variance aggregation types in libcudf Debug build (#12799) @davidwendt
Expose seed argument to hash_values (#12795) @ayushdg
Fix groupby gtests coded in namespace cudf::test (#12784) @davidwendt
Improve performance for cudf::strings::count_characters for long strings (#12779) @davidwendt
Deallocate encoded data in ORC writer immediately after compression (#12770) @vuule
Stop force pulling fmt in nvbench. (#12768) @vyasr
Remove now redundant cuda initialization (#12758) @vyasr
Adds JSON reader, writer io benchmark (#12753) @karthikeyann
Use test paths relative to package directory. (#12751) @bdice
Add build metrics report as artifact to cpp-build workflow (#12750) @davidwendt
Add JNI methods for detecting and purging non-empty nulls from LIST and STRUCT (#12742) @razajafri
Stop using versioneer to manage versions (#12741) @vyasr
Reduce error handling verbosity in CI tests scripts (#12738) @AjayThorve
Reduce the number of test cases in multibyte_split benchmark (#12737) @PointKernel
Update shared workflow branches (#12733) @ajschmidt8
JNI switches to nested JSON reader (#12732) @res-life
Changing cudf::io::source_info to use cudf::host_span<std::byte> in a non-breaking form (#12730) @hyperbolic2346
Add nvbench environment class for initializing RMM in benchmarks (#12728) @davidwendt
Split C++ and Python build dependencies into separate lists. (#12724) @bdice
Add build dependencies to Java tests. (#12723) @bdice
Allow setting the seed argument for hash partition (#12715) @firestarman
Remove gpuCI scripts. (#12712) @bdice
Unpin dask and distributed for development (#12710) @galipremsagar
partition_by_hash(): use _split() (#12704) @madsbk
Remove DataFrame.quantiles from docs. (#12684) @bdice
Fast path for experimental::row::equality (#12676) @divyegala
Move date to build string in conda recipe (#12661) @ajschmidt8
Refactor reduction logic for fixed-point types (#12652) @davidwendt
Pay off some JNI RMM API tech debt (#12632) @revans2
Merge copy-on-write feature branch into branch-23.04 (#12619) @galipremsagar
Remove cudf::strings::repeatstringsoutputsizes and optional parameter from cudf::strings::repeatstrings (#12609) @davidwendt
Pin cuda-nvrtc. (#12606) @bdice
Remove cudf::test::print calls from libcudf gtests (#12604) @davidwendt
Init JNI version 23.04.0-SNAPSHOT (#12599) @pxLi
Add performance benchmarks to user facing docs (#12595) @galipremsagar
Add docs build job (#12592) @AyodeAwe
Replace message parsing with throwing more specific exceptions (#12426) @vyasr
Support conversion to/from cudf in dask.dataframe.core.to_backend (#12380) @rjzamora

- C++
Published by raydouglass over 2 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v23.10.00

🔗 Links

🚨 Breaking Changes

Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Create tableinputmetadata from a table_metadata (#13920) @etseidl
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Update to Cython 3.0.0 (#13777) @vyasr
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

🐛 Bug Fixes

Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
Fix inaccuracy in decimal128 rounding. (#14233) @bdice
Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
Fix pytorch related pytest (#14198) @galipremsagar
Pin to aws-sdk-cpp<1.11 (#14173) @pentschev
Fix assert failure for range window functions (#14168) @mythrocks
Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
Fix kernel launch error for cudf::io::orc::gpu::rowgroupcharcounts_kernel (#14139) @davidwendt
Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
Fix DataFrame.values with no columns but index (#14134) @mroeschke
Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
Preserve name of the column while initializing a DataFrame (#14110) @galipremsagar
Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
Drop kwargs from Series.count (#14106) @galipremsagar
Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
Only use memory resources that haven't been freed (#14103) @robertmaynard
Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
Validate ignoreindex type in dropduplicates (#14098) @mroeschke
Fix renaming Series and Index (#14080) @galipremsagar
Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
Raise NotImplementedError in todatetime with dayfirst without inferformat (#14058) @mroeschke
Fix various issues in Index.intersection (#14054) @galipremsagar
Fix Index.difference to match with pandas (#14053) @galipremsagar
Fix empty string column construction (#14052) @galipremsagar
Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Ignore compile_commands.json (#14048) @harrism
Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
Implement sort_remaining for sort_index (#14033) @wence-
Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
Fix return type of MultiIndex.difference (#14009) @galipremsagar
Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
Fix map column can not be non-nullable for java (#14003) @res-life
Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
Use threadindextype to avoid out of bounds accesses in conditional joins (#13971) @vyasr
Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
Use cudf::threadindextype in getjsonobject and tdigest kernels (#13962) @nvdbaranec
Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
Fixed processed bytes calculation in APPLYBOOLEANMASK benchmark. (#13937) @Blonck
Fix construction of Grouping objects (#13932) @galipremsagar
Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
Fix handling of typecasting in searchsorted (#13925) @galipremsagar
Preserve index name in reindex (#13917) @galipremsagar
Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
Use cudf::threadindextype in concatenate.cu. (#13906) @bdice
Use cudf::threadindextype in replace.cu. (#13905) @bdice
Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
Fix nvtext::generatecharacterngrams performance regression for longer strings (#13874) @davidwendt
Fix return type of MultiIndex.levels (#13870) @galipremsagar
Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
Fix binary operations between Series and Index (#13842) @galipremsagar
Update makelistscolumnfromscalar to use makeoffsetschild_column utility (#13841) @davidwendt
Fix read out of bounds in string concatenate (#13838) @pentschev
Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
Fix cuFile I/O factories (#13829) @vuule
DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
Branch 23.10 merge 23.08 (#13822) @vyasr
Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
Raise error when mixed types are being constructed (#13816) @galipremsagar
Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Fix negative unary operation for boolean type (#13780) @galipremsagar
Fix contains(in) method for Series (#13779) @galipremsagar
Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
Preserve names of column object in various APIs (#13772) @galipremsagar
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
Provide our own Cython declaration for make_unique (#13746) @wence-

📖 Documentation

Fix typo in docstring: metadata. (#14025) @bdice
Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
Simplify Python doc configuration (#13826) @vyasr
Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
Fix all warnings in Python docs (#13789) @vyasr

🚀 New Features

[Java] Add JNI bindings for integers_to_hex (#14205) @razajafri
Propagate errors from Parquet reader kernels back to host (#14167) @vuule
JNI for HISTOGRAM and MERGE_HISTOGRAM aggregations (#14154) @ttnghia
Expose streams in all public sorting APIs (#14146) @vyasr
Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
Implement GroupBy.value_counts to match pandas API (#14114) @stmio
Refactor parquet thrift reader (#14097) @etseidl
Refactor hash_reduce_by_row (#14095) @ttnghia
Support negative preceding/following for ROW window functions (#14093) @mythrocks
Support for progressive parquet chunked reading. (#14079) @nvdbaranec
Implement HISTOGRAM and MERGE_HISTOGRAM aggregations (#14045) @ttnghia
Expose streams in public search APIs (#14034) @vyasr
Expose streams in public replace APIs (#14010) @vyasr
Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
Expose streams in public filling APIs (#13990) @vyasr
Expose streams in public concatenate APIs (#13987) @vyasr
Use HostMemoryAllocator in jni::allocatehostbuffer (#13975) @gerashegalov
Enable fractional null probability for hashing benchmark (#13967) @Blonck
Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
Add nvtext::tokenizewithvocabulary API (#13930) @davidwendt
Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
Add HostMemoryAllocator interface (#13924) @gerashegalov
Global stream pool (#13922) @etseidl
Create tableinputmetadata from a table_metadata (#13920) @etseidl
Translate column size overflow exception to JNI (#13911) @mythrocks
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Exclude some tests from running with the compute sanitizer (#13872) @firestarman
Expand statistics support in ORC writer (#13848) @vuule
Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
Add cudf::strings::find function with target per row (#13808) @davidwendt
Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
Support corr in GroupBy.apply through the jit engine (#13767) @shwina
Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
Support more numeric types in Groupby.apply with engine='jit' (#13729) @brandon-b-miller
[FEA] Add DELTABINARYPACKED decoding support to Parquet reader (#13637) @etseidl
Read FIXEDLENBYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

🛠️ Improvements

Update shared-action-workflows references (backport from 23.12 to 23.10) (#14300) @AyodeAwe
Pin dask and distributed for 23.10 release (#14225) @galipremsagar
update rmm tag path (#14195) @AyodeAwe
Disable Recently Updated Check (#14193) @ajschmidt8
Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
Add Parquet reader benchmarks for row selection (#14147) @vuule
Update image names (#14145) @AyodeAwe
Support callables in DataFrame.assign (#14142) @wence-
Reduce memory usage of ascategoricalcolumn (#14138) @wence-
Replace Python scalar conversions with libcudf (#14124) @vyasr
Update to clang 16.0.6. (#14120) @bdice
Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
Add stream parameter to external dict APIs (#14115) @SurajAralihalli
Add fallback matrix for nvcomp. (#14082) @bdice
[Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
Remove header tests (#14072) @ajschmidt8
Refactor contains_table with cuco::static_set (#14064) @PointKernel
Remove debug print in a Parquet test (#14063) @vuule
Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Expose stream parameter in public strings find APIs (#14060) @davidwendt
Update doxygen to 1.9.1 (#14059) @vyasr
Remove the mr from the base fixture (#14057) @vyasr
Expose streams in public strings case APIs (#14056) @davidwendt
Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
Use cudf::makeemptycolumn instead of column_view constructor (#14030) @davidwendt
Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
Explicitly depend on zlib in conda recipes (#14018) @wence-
Use grid_stride for stride computations. (#13996) @bdice
Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
Use thread_index_type in partitioning.cu (#13973) @divyegala
Use cudf::thread_index_type in merge.cu (#13972) @divyegala
Use copy-pr-bot (#13970) @ajschmidt8
Use cudf::threadindextype in strings custom kernels (#13968) @davidwendt
Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
Added pinned pool reservation API for java (#13964) @revans2
Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
Add bytes_per_second to copyifelse benchmark (#13960) @Blonck
Add pandas compatible output to Series.unique (#13959) @galipremsagar
Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
Make HostColumnVector.getRefCount public (#13934) @abellina
Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
Add java API to get size of host memory needed to copy column view (#13919) @revans2
Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
Enable hugepage for arrow host allocations (#13914) @madsbk
Improve performance of nvtext::edit_distance (#13912) @davidwendt
Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
Use empty() instead of size() where possible (#13908) @vuule
[JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
Allow explicit shuffle="p2p" within dask-cudf API (#13893) @rjzamora
Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
Fixes a performance regression in FST (#13850) @elstehle
Set native handles to null on close in Java wrapper classes (#13818) @jlowe
Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
Update lists::contains to experimental row comparator (#13810) @divyegala
Reduce lists::contains dispatches for scalars (#13805) @divyegala
Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
Update to Cython 3.0.0 (#13777) @vyasr
Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
Branch 23.10 merge 23.08 (#13773) @vyasr
Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
Branch 23.10 merge 23.08 (#13753) @vyasr
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Refactors JSON reader's pushdown automaton (#13716) @elstehle
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

- C++
Published by rapids-bot[bot] over 2 years ago

https://github.com/rapidsai/cudf - v23.10.00

🚨 Breaking Changes

Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Create tableinputmetadata from a table_metadata (#13920) @etseidl
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Update to Cython 3.0.0 (#13777) @vyasr
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

🐛 Bug Fixes

Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
Fix inaccuracy in decimal128 rounding. (#14233) @bdice
Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
Fix pytorch related pytest (#14198) @galipremsagar
Pin to aws-sdk-cpp<1.11 (#14173) @pentschev
Fix assert failure for range window functions (#14168) @mythrocks
Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
Fix kernel launch error for cudf::io::orc::gpu::rowgroupcharcounts_kernel (#14139) @davidwendt
Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
Fix DataFrame.values with no columns but index (#14134) @mroeschke
Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
Preserve name of the column while initializing a DataFrame (#14110) @galipremsagar
Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
Drop kwargs from Series.count (#14106) @galipremsagar
Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
Only use memory resources that haven't been freed (#14103) @robertmaynard
Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
Validate ignoreindex type in dropduplicates (#14098) @mroeschke
Fix renaming Series and Index (#14080) @galipremsagar
Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
Raise NotImplementedError in todatetime with dayfirst without inferformat (#14058) @mroeschke
Fix various issues in Index.intersection (#14054) @galipremsagar
Fix Index.difference to match with pandas (#14053) @galipremsagar
Fix empty string column construction (#14052) @galipremsagar
Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Ignore compile_commands.json (#14048) @harrism
Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
Implement sort_remaining for sort_index (#14033) @wence-
Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
Fix return type of MultiIndex.difference (#14009) @galipremsagar
Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
Fix map column can not be non-nullable for java (#14003) @res-life
Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
Use threadindextype to avoid out of bounds accesses in conditional joins (#13971) @vyasr
Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
Use cudf::threadindextype in getjsonobject and tdigest kernels (#13962) @nvdbaranec
Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
Fixed processed bytes calculation in APPLYBOOLEANMASK benchmark. (#13937) @Blonck
Fix construction of Grouping objects (#13932) @galipremsagar
Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
Fix handling of typecasting in searchsorted (#13925) @galipremsagar
Preserve index name in reindex (#13917) @galipremsagar
Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
Use cudf::threadindextype in concatenate.cu. (#13906) @bdice
Use cudf::threadindextype in replace.cu. (#13905) @bdice
Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
Fix nvtext::generatecharacterngrams performance regression for longer strings (#13874) @davidwendt
Fix return type of MultiIndex.levels (#13870) @galipremsagar
Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
Fix binary operations between Series and Index (#13842) @galipremsagar
Update makelistscolumnfromscalar to use makeoffsetschild_column utility (#13841) @davidwendt
Fix read out of bounds in string concatenate (#13838) @pentschev
Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
Fix cuFile I/O factories (#13829) @vuule
DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
Branch 23.10 merge 23.08 (#13822) @vyasr
Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
Raise error when mixed types are being constructed (#13816) @galipremsagar
Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Fix negative unary operation for boolean type (#13780) @galipremsagar
Fix contains(in) method for Series (#13779) @galipremsagar
Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
Preserve names of column object in various APIs (#13772) @galipremsagar
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
Provide our own Cython declaration for make_unique (#13746) @wence-

📖 Documentation

Fix typo in docstring: metadata. (#14025) @bdice
Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
Simplify Python doc configuration (#13826) @vyasr
Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
Fix all warnings in Python docs (#13789) @vyasr

🚀 New Features

[Java] Add JNI bindings for integers_to_hex (#14205) @razajafri
Propagate errors from Parquet reader kernels back to host (#14167) @vuule
JNI for HISTOGRAM and MERGE_HISTOGRAM aggregations (#14154) @ttnghia
Expose streams in all public sorting APIs (#14146) @vyasr
Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
Implement GroupBy.value_counts to match pandas API (#14114) @stmio
Refactor parquet thrift reader (#14097) @etseidl
Refactor hash_reduce_by_row (#14095) @ttnghia
Support negative preceding/following for ROW window functions (#14093) @mythrocks
Support for progressive parquet chunked reading. (#14079) @nvdbaranec
Implement HISTOGRAM and MERGE_HISTOGRAM aggregations (#14045) @ttnghia
Expose streams in public search APIs (#14034) @vyasr
Expose streams in public replace APIs (#14010) @vyasr
Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
Expose streams in public filling APIs (#13990) @vyasr
Expose streams in public concatenate APIs (#13987) @vyasr
Use HostMemoryAllocator in jni::allocatehostbuffer (#13975) @gerashegalov
Enable fractional null probability for hashing benchmark (#13967) @Blonck
Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
Add nvtext::tokenizewithvocabulary API (#13930) @davidwendt
Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
Add HostMemoryAllocator interface (#13924) @gerashegalov
Global stream pool (#13922) @etseidl
Create tableinputmetadata from a table_metadata (#13920) @etseidl
Translate column size overflow exception to JNI (#13911) @mythrocks
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Exclude some tests from running with the compute sanitizer (#13872) @firestarman
Expand statistics support in ORC writer (#13848) @vuule
Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
Add cudf::strings::find function with target per row (#13808) @davidwendt
Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
Support corr in GroupBy.apply through the jit engine (#13767) @shwina
Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
Support more numeric types in Groupby.apply with engine='jit' (#13729) @brandon-b-miller
[FEA] Add DELTABINARYPACKED decoding support to Parquet reader (#13637) @etseidl
Read FIXEDLENBYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

🛠️ Improvements

Pin dask and distributed for 23.10 release (#14225) @galipremsagar
update rmm tag path (#14195) @AyodeAwe
Disable Recently Updated Check (#14193) @ajschmidt8
Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
Add Parquet reader benchmarks for row selection (#14147) @vuule
Update image names (#14145) @AyodeAwe
Support callables in DataFrame.assign (#14142) @wence-
Reduce memory usage of ascategoricalcolumn (#14138) @wence-
Replace Python scalar conversions with libcudf (#14124) @vyasr
Update to clang 16.0.6. (#14120) @bdice
Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
Add stream parameter to external dict APIs (#14115) @SurajAralihalli
Add fallback matrix for nvcomp. (#14082) @bdice
[Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
Remove header tests (#14072) @ajschmidt8
Refactor contains_table with cuco::static_set (#14064) @PointKernel
Remove debug print in a Parquet test (#14063) @vuule
Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Expose stream parameter in public strings find APIs (#14060) @davidwendt
Update doxygen to 1.9.1 (#14059) @vyasr
Remove the mr from the base fixture (#14057) @vyasr
Expose streams in public strings case APIs (#14056) @davidwendt
Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
Use cudf::makeemptycolumn instead of column_view constructor (#14030) @davidwendt
Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
Explicitly depend on zlib in conda recipes (#14018) @wence-
Use grid_stride for stride computations. (#13996) @bdice
Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
Use thread_index_type in partitioning.cu (#13973) @divyegala
Use cudf::thread_index_type in merge.cu (#13972) @divyegala
Use copy-pr-bot (#13970) @ajschmidt8
Use cudf::threadindextype in strings custom kernels (#13968) @davidwendt
Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
Added pinned pool reservation API for java (#13964) @revans2
Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
Add bytes_per_second to copyifelse benchmark (#13960) @Blonck
Add pandas compatible output to Series.unique (#13959) @galipremsagar
Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
Make HostColumnVector.getRefCount public (#13934) @abellina
Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
Add java API to get size of host memory needed to copy column view (#13919) @revans2
Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
Enable hugepage for arrow host allocations (#13914) @madsbk
Improve performance of nvtext::edit_distance (#13912) @davidwendt
Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
Use empty() instead of size() where possible (#13908) @vuule
[JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
Allow explicit shuffle="p2p" within dask-cudf API (#13893) @rjzamora
Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
Fixes a performance regression in FST (#13850) @elstehle
Set native handles to null on close in Java wrapper classes (#13818) @jlowe
Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
Update lists::contains to experimental row comparator (#13810) @divyegala
Reduce lists::contains dispatches for scalars (#13805) @divyegala
Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
Update to Cython 3.0.0 (#13777) @vyasr
Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
Branch 23.10 merge 23.08 (#13773) @vyasr
Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
Branch 23.10 merge 23.08 (#13753) @vyasr
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Refactors JSON reader's pushdown automaton (#13716) @elstehle
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

- C++
Published by raydouglass over 2 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v23.12.00

🔗 Links

🚨 Breaking Changes

Expose stream parameter in public strings convert APIs (#14255) @davidwendt

🐛 Bug Fixes

Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
Fix memset error in nvtext::editdistancematrix (#14283) @davidwendt
Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
Handle empty string correctly in Parquet statistics (#14257) @etseidl
Fixes behaviour for incomplete lines when recover_with_nulls is enabled (#14252) @elstehle
cudf::detail::pinned_allocator doesn't throw from deallocate (#14251) @robertmaynard
Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
Fixing parquet list of struct interpretation (#13715) @hyperbolic2346

🚀 New Features

Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
Expose streams in public null mask APIs (#14263) @vyasr
Expose streams in binaryop APIs (#14187) @vyasr
Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
Add DELTABINARYPACKED encoder for Parquet writer (#14100) @etseidl

🛠️ Improvements

Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
Update shared-action-workflows references (#14289) @AyodeAwe
Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
Use branch-23.12 workflows. (#14271) @bdice
Refactor LogicalType for Parquet (#14264) @etseidl
Centralize chunked reading code in the parquet reader to readerimplchunking.cu (#14262) @nvdbaranec
Expose stream parameter in public strings replace APIs (#14261) @davidwendt
Expose stream parameter in public strings APIs (#14260) @davidwendt
Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
Make parquet schema index type consistent (#14256) @hyperbolic2346
Expose stream parameter in public strings convert APIs (#14255) @davidwendt
Add in java bindings for DataSource (#14254) @revans2
Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
Improve contains_column by invoking contains_table (#14238) @PointKernel
Detect and report errors in Parquet header parsing (#14237) @etseidl
Forward merge 23.10 into 23.12 (#14231) @galipremsagar
Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
Enable indexalator for device code (#14206) @davidwendt
Marginally reduce memory footprint of joins (#14197) @wence-
Add nvtx annotations to spilling-based data movement (#14196) @wence-
Remove the use of volatile in ORC (#14175) @vuule
Add bytes_per_second to distinctcount of streamcompaction nvbench. (#14172) @Blonck
Add bytes_per_second to transpose benchmark (#14170) @Blonck
cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
Add bytes_per_second to shift benchmark (#13950) @Blonck

- C++
Published by rapids-bot[bot] over 2 years ago

https://github.com/rapidsai/cudf - v23.08.00

🚨 Breaking Changes

Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
Expose streams in all public copying APIs (#13629) @vyasr
Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
Remove deprecated cudf.set_allocator. (#13591) @bdice
Change build.sh to use pip install instead of setup.py (#13507) @vyasr
Remove unused maxrowstensor parameter from subword tokenizer (#13463) @davidwendt
Fix decimal scale reductions in _get_decimal_type (#13224) @charlesbluca

🐛 Bug Fixes

Add CUDA version to cudf_kafka and libcudf-example build strings. (#13769) @bdice
Fix typo in wheels-test.yaml. (#13763) @bdice
Don't test strings shorter than the requested ngram size (#13758) @vyasr
Add CUDA version to custreamz build string. (#13754) @bdice
Fix writing of ORC files with empty child string columns (#13745) @vuule
Remove the erroneous "empty level" short-circuit from ORC reader (#13722) @vuule
Fix character counting when writing sliced tables into ORC (#13721) @vuule
Parquet uses row group row count if missing from header (#13712) @hyperbolic2346
Fix reading of RLE encoded boolean data from parquet files with V2 page headers (#13707) @etseidl
Fix a corner case of list lexicographic comparator (#13701) @ttnghia
Fix combined filtering and column projection in dask_cudf.read_parquet (#13697) @rjzamora
Revert fetch-rapids changes (#13696) @vyasr
Data generator - include offsets in the size estimate of list elments (#13688) @vuule
Add cuda-nvcc-impl to cudf for numba CUDA 12 (#13673) @jakirkham
Fix combined filtering and column projection in read_parquet (#13666) @rjzamora
Use thrust::identity as hash functions for byte pair encoding (#13665) @PointKernel
Fix loc-getitem ordering when index contains duplicate labels (#13659) @wence-
[REVIEW] Introduce parity with pandas for MultiIndex.loc ordering & fix a bug in Groupby with as_index (#13657) @galipremsagar
Fix memcheck error found in nvtext tokenize functions (#13649) @davidwendt
Fix has_nonempty_nulls ignoring column offset (#13647) @ttnghia
[Java] Avoid double-free corruption in case of an Exception while creating a ColumnView (#13645) @razajafri
Fix memcheck error in ORC reader call to cudf::io::copyuncompressedkernel (#13643) @davidwendt
Fix CUDA 12 conda environment to remove cubinlinker and ptxcompiler. (#13636) @bdice
Fix inf/NaN comparisons for FLOAT orderby in window functions (#13635) @mythrocks
Refactor Index search to simplify code and increase correctness (#13625) @wence-
Fix compile warning for unused variable in split_re.cu (#13621) @davidwendt
Fix tzlocalize for daskcudf Series (#13610) @shwina
Fix issue with no decompressed data in ORC reader (#13609) @vuule
Fix floating point window range extents. (#13606) @mythrocks
Fix localize(None) for timezone-naive columns (#13603) @shwina
Fixed a memory leak caused by Exception thrown while constructing a ColumnView (#13597) @razajafri
Handle nullptr return value from bitmaskor in distinctcount (#13590) @wence-
Bring parity with pandas in Index.join (#13589) @galipremsagar
Fix cudf.melt when there are more than 255 columns (#13588) @hcho3
Fix memory issues in cuIO due to removal of memory padding (#13586) @ttnghia
Fix Parquet multi-file reading (#13584) @etseidl
Fix memcheck error found in LISTS_TEST (#13579) @davidwendt
Fix memcheck error found in STRINGS_TEST (#13578) @davidwendt
Fix memcheck error found in INTEROP_TEST (#13577) @davidwendt
Fix memcheck errors found in REDUCTION_TEST (#13574) @davidwendt
Preemptive fix for hive-partitioning change in dask (#13564) @rjzamora
Fix an issue with dask_cudf.read_csv when lines are needed to be skipped (#13555) @galipremsagar
Fix out-of-bounds memory write in cudf::dictionary::detail::concatenate (#13554) @davidwendt
Fix the null mask size in json reader (#13537) @karthikeyann
Fix cudf::strings::strip for all-empty input column (#13533) @davidwendt
Make sure to build without isolation or installing dependencies (#13524) @vyasr
Remove preload lib from CMake for now (#13519) @vyasr
Fix missing separator after null values in JSON writer (#13503) @karthikeyann
Ensure single_lane_block_sum_reduce is safe to call in a loop (#13488) @wence-
Update all versions in pyproject.toml files. (#13486) @bdice
Remove applying nvbench that doesn't exist in 23.08 (#13484) @robertmaynard
Fix chunked Parquet reader benchmark (#13482) @vuule
Update JNI JSON reader column compatability for Spark (#13477) @revans2
Fix unsanitized output of scan with strings (#13455) @davidwendt
Reject functions without bytecode from _can_be_jitted in GroupBy Apply (#13429) @brandon-b-miller
Fix decimal scale reductions in _get_decimal_type (#13224) @charlesbluca

📖 Documentation

Fix doxygen groups for io data sources and sinks (#13718) @davidwendt
Add pandas compatibility note to DataFrame.query docstring (#13693) @beckernick
Add pylibcudf to developer guide (#13639) @vyasr
Fix repeated words in doxygen text (#13598) @karthikeyann
Update docs for top-level API. (#13592) @bdice
Fix the the doxygen text for cudf::concatenate and other places (#13561) @davidwendt
Document stream validation approach used in testing (#13556) @vyasr
Cleanup doc repetitions in libcudf (#13470) @karthikeyann

🚀 New Features

Support min and max aggregations for list type in groupby and reduction (#13676) @ttnghia
Add nvtext::jaccard_index API for strings columns (#13669) @davidwendt
Add readparquetmetadata libcudf API (#13663) @karthikeyann
Expose streams in all public copying APIs (#13629) @vyasr
Add XXHash_64 hash function to cudf (#13612) @davidwendt
Java support: Floating point order-by columns for RANGE window functions (#13595) @mythrocks
Use cuco::static_map to build string dictionaries in ORC writer (#13580) @vuule
Add pylibcudf subpackage with gather implementation (#13562) @vyasr
Add JNI for lists::concatenate_list_elements (#13547) @ttnghia
Enable nested types for lists::concatenate_list_elements (#13545) @ttnghia
Add unicode encoding for string columns in JSON writer (#13539) @karthikeyann
Remove numba kernels from find_index_of_val (#13517) @brandon-b-miller
Floating point order-by columns for RANGE window functions (#13512) @mythrocks
Parse column chunk metadata statistics in parquet reader (#13472) @karthikeyann
Add abs function to apply (#13408) @brandon-b-miller
[FEA] AST filtering in parquet reader (#13348) @karthikeyann
[FEA] Adds option to recover from invalid JSON lines in JSON tokenizer (#13344) @elstehle
Ensure cccl packages don't clash with upstream version (#13235) @robertmaynard
Update struct_minmax_util to experimental row comparator (#13069) @divyegala
Add stream parameter to hashing APIs (#12090) @vyasr

🛠️ Improvements

Pin dask and distributed for 23.08 release (#13802) @galipremsagar
Relax protobuf pinnings. (#13770) @bdice
Switch fully unbounded window functions to use aggregations (#13727) @mythrocks
Switch to new wheel building pipeline (#13723) @vyasr
Revert CUDA 12.0 CI workflows to branch-23.08. (#13719) @bdice
Adding identify minimum version requirement (#13713) @hyperbolic2346
Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
Optimize ORC reader performance for list data (#13708) @vyasr
fix limit overflow message in a docstring (#13703) @ahmet-uyar
Alleviates JSON parser's need for multi-file sources to end with a newline (#13702) @elstehle
Update cython-lint and replace flake8 with ruff (#13699) @vyasr
Add __dask_tokenize__ definitions to cudf classes (#13695) @rjzamora
Convert libcudf hashing benchmarks to nvbench (#13694) @davidwendt
Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
Improve performance of cudf::strings::split on whitespace (#13680) @davidwendt
Allow ORC and Parquet writers to write nullable columns without nulls as non-nullable (#13675) @vuule
Raise a NotImplementedError in to_datetime when utc is passed (#13670) @shwina
Add rmm_mode parameter to nvbench base fixture (#13668) @davidwendt
Fix multiindex loc ordering in pandas-compat mode (#13660) @wence-
Add nvtext hashcharacterngrams function (#13654) @davidwendt
Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
Acquire spill lock in to/from_arrow (#13646) @shwina
Expose stable versions of libcudf sort routines (#13634) @wence-
Separate out hash_test.cpp source for each hash API (#13633) @davidwendt
Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
Create separate libcudf hash APIs for each supported hash function (#13626) @davidwendt
Add convert_dtypes API (#13623) @shwina
Clean up cupy in dependencies.yaml. (#13617) @bdice
Use cuda-version to constrain cudatoolkit. (#13615) @bdice
Add murmurhash3x64128 function to libcudf (#13604) @davidwendt
Performance improvement for cudf::strings::like (#13594) @davidwendt
Remove deprecated cudf.set_allocator. (#13591) @bdice
Clean up cudf device atomic with cuda::atomic_ref (#13583) @PointKernel
Add java bindings for distinct count (#13573) @revans2
Use nvcomp conda package. (#13566) @bdice
Add exception to stringscalar if input string exceeds sizetype (#13560) @davidwendt
Add dispatch for cudf.Dataframe to/from pyarrow.Table conversion (#13558) @rjzamora
Get rid of cuco::pair_type aliases (#13553) @PointKernel
Introduce parity with pandas when sort=False in Groupby (#13551) @galipremsagar
Update CMake in docker to 3.26.4 (#13550) @NvTimLiu
Clarify source of error message in stream testing. (#13541) @bdice
Deprecate strings_to_categorical in cudf.read_parquet (#13540) @galipremsagar
Update to CMake 3.26.4 (#13538) @vyasr
s3 folder naming fix (#13536) @AyodeAwe
Implement iloc-getitem using parse-don't-validate approach (#13534) @wence-
Make synchronization explicit in the names of hostdevice_* copying APIs (#13530) @ttnghia
Add benchmark (Google Benchmark) dependency to conda packages. (#13528) @bdice
Add libcufile to dependencies.yaml. (#13523) @bdice
Fix some memoization logic in groupby/sort/sort_helper.cu (#13521) @davidwendt
Use sizestooffsets_iterator in cudf::gather for strings (#13520) @davidwendt
use rapids-upload-docs script (#13518) @AyodeAwe
Support UTF-8 BOM in CSV reader (#13516) @davidwendt
Move stream-related test configuration to CMake (#13513) @vyasr
Implement cudf.option_context (#13511) @galipremsagar
Unpin dask and distributed for development (#13508) @galipremsagar
Change build.sh to use pip install instead of setup.py (#13507) @vyasr
Use test default stream (#13506) @vyasr
Remove documentation build scripts for Jenkins (#13495) @ajschmidt8
Use east const in include files (#13494) @karthikeyann
Use east const in src files (#13493) @karthikeyann
Use east const in tests files (#13492) @karthikeyann
Use east const in benchmarks files (#13491) @karthikeyann
Performance improvement for nvtext tokenize/token functions (#13480) @davidwendt
Add pd.Float*Dtype to Avro and ORC mappings (#13475) @mroeschke
Use pandas public APIs where available (#13467) @mroeschke
Allow pd.ArrowDtype in cudf.from_pandas (#13465) @mroeschke
Rework libcudf regex benchmarks with nvbench (#13464) @davidwendt
Remove unused maxrowstensor parameter from subword tokenizer (#13463) @davidwendt
Separate io-text and nvtext pytests into different files (#13435) @davidwendt
Add a moveto function to cudf::stringview::const_iterator (#13428) @davidwendt
Allow newer scikit-build (#13424) @vyasr
Refactor sortbyvalues to sort_values, drop indices from return values. (#13419) @bdice
Inline Cython exception handler (#13411) @vyasr
Init JNI version 23.08.0-SNAPSHOT (#13401) @pxLi
Refactor ORC reader (#13396) @ttnghia
JNI: Remove cleaned objects in memory cleaner (#13378) @res-life
Add tests of currently unsupported indexing (#13338) @wence-
Performance improvement for some libcudf regex functions for long strings (#13322) @davidwendt
Exposure Tracked Buffer (first step towards unifying copy-on-write and spilling) (#13307) @madsbk
Write string data directly to column_buffer in Parquet reader (#13302) @etseidl
Add stacktrace into cudf exception types (#13298) @ttnghia
cuDF: Build CUDA 12 packages (#12922) @bdice

- C++
Published by raydouglass almost 3 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v23.10.00

🔗 Links

🚨 Breaking Changes

Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Create tableinputmetadata from a table_metadata (#13920) @etseidl
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Update to Cython 3.0.0 (#13777) @vyasr
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

🐛 Bug Fixes

Fix kernel launch error for cudf::io::orc::gpu::rowgroupcharcounts_kernel (#14139) @davidwendt
Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
Fix DataFrame.values with no columns but index (#14134) @mroeschke
Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
Drop kwargs from Series.count (#14106) @galipremsagar
Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
Only use memory resources that haven't been freed (#14103) @robertmaynard
Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
Validate ignoreindex type in dropduplicates (#14098) @mroeschke
Fix renaming Series and Index (#14080) @galipremsagar
Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
Raise NotImplementedError in todatetime with dayfirst without inferformat (#14058) @mroeschke
Fix various issues in Index.intersection (#14054) @galipremsagar
Fix Index.difference to match with pandas (#14053) @galipremsagar
Fix empty string column construction (#14052) @galipremsagar
Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Ignore compile_commands.json (#14048) @harrism
Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
Implement sort_remaining for sort_index (#14033) @wence-
Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
Fix return type of MultiIndex.difference (#14009) @galipremsagar
Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
Fix map column can not be non-nullable for java (#14003) @res-life
Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
Use threadindextype to avoid out of bounds accesses in conditional joins (#13971) @vyasr
Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
Use cudf::threadindextype in getjsonobject and tdigest kernels (#13962) @nvdbaranec
Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
Fixed processed bytes calculation in APPLYBOOLEANMASK benchmark. (#13937) @Blonck
Fix construction of Grouping objects (#13932) @galipremsagar
Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
Fix handling of typecasting in searchsorted (#13925) @galipremsagar
Preserve index name in reindex (#13917) @galipremsagar
Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
Use cudf::threadindextype in concatenate.cu. (#13906) @bdice
Use cudf::threadindextype in replace.cu. (#13905) @bdice
Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
Fix nvtext::generatecharacterngrams performance regression for longer strings (#13874) @davidwendt
Fix return type of MultiIndex.levels (#13870) @galipremsagar
Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
Fix binary operations between Series and Index (#13842) @galipremsagar
Update makelistscolumnfromscalar to use makeoffsetschild_column utility (#13841) @davidwendt
Fix read out of bounds in string concatenate (#13838) @pentschev
Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
Fix cuFile I/O factories (#13829) @vuule
DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
Branch 23.10 merge 23.08 (#13822) @vyasr
Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
Raise error when mixed types are being constructed (#13816) @galipremsagar
Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Fix negative unary operation for boolean type (#13780) @galipremsagar
Fix contains(in) method for Series (#13779) @galipremsagar
Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
Preserve names of column object in various APIs (#13772) @galipremsagar
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
Provide our own Cython declaration for make_unique (#13746) @wence-

📖 Documentation

Fix typo in docstring: metadata. (#14025) @bdice
Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
Simplify Python doc configuration (#13826) @vyasr
Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
Fix all warnings in Python docs (#13789) @vyasr

🚀 New Features

Implement GroupBy.value_counts to match pandas API (#14114) @stmio
Refactor parquet thrift reader (#14097) @etseidl
Refactor hash_reduce_by_row (#14095) @ttnghia
Support negative preceding/following for ROW window functions (#14093) @mythrocks
Expose streams in public search APIs (#14034) @vyasr
Expose streams in public replace APIs (#14010) @vyasr
Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
Expose streams in public filling APIs (#13990) @vyasr
Expose streams in public concatenate APIs (#13987) @vyasr
Use HostMemoryAllocator in jni::allocatehostbuffer (#13975) @gerashegalov
Enable fractional null probability for hashing benchmark (#13967) @Blonck
Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
Add HostMemoryAllocator interface (#13924) @gerashegalov
Global stream pool (#13922) @etseidl
Create tableinputmetadata from a table_metadata (#13920) @etseidl
Translate column size overflow exception to JNI (#13911) @mythrocks
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Exclude some tests from running with the compute sanitizer (#13872) @firestarman
Expand statistics support in ORC writer (#13848) @vuule
Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
Add cudf::strings::find function with target per row (#13808) @davidwendt
Add minhash support for MurmurHash3x64128 (#13796) @davidwendt
Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
Support corr in GroupBy.apply through the jit engine (#13767) @shwina
Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
Support more numeric types in Groupby.apply with engine='jit' (#13729) @brandon-b-miller
[FEA] Add DELTABINARYPACKED decoding support to Parquet reader (#13637) @etseidl
Read FIXEDLENBYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

🛠️ Improvements

Reduce memory usage of ascategoricalcolumn (#14138) @wence-
Update to clang 16.0.6. (#14120) @bdice
Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
Add fallback matrix for nvcomp. (#14082) @bdice
[Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
Remove header tests (#14072) @ajschmidt8
Remove debug print in a Parquet test (#14063) @vuule
Expose stream parameter in public strings find APIs (#14060) @davidwendt
Update doxygen to 1.9.1 (#14059) @vyasr
Remove the mr from the base fixture (#14057) @vyasr
Expose streams in public strings case APIs (#14056) @davidwendt
Use cudf::makeemptycolumn instead of column_view constructor (#14030) @davidwendt
Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
Explicitly depend on zlib in conda recipes (#14018) @wence-
Use grid_stride for stride computations. (#13996) @bdice
Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
Use thread_index_type in partitioning.cu (#13973) @divyegala
Use cudf::thread_index_type in merge.cu (#13972) @divyegala
Use copy-pr-bot (#13970) @ajschmidt8
Use cudf::threadindextype in strings custom kernels (#13968) @davidwendt
Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
Added pinned pool reservation API for java (#13964) @revans2
Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
Add bytes_per_second to copyifelse benchmark (#13960) @Blonck
Add pandas compatible output to Series.unique (#13959) @galipremsagar
Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
Make HostColumnVector.getRefCount public (#13934) @abellina
Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
Add java API to get size of host memory needed to copy column view (#13919) @revans2
Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
Enable hugepage for arrow host allocations (#13914) @madsbk
Improve performance of nvtext::edit_distance (#13912) @davidwendt
Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
Use empty() instead of size() where possible (#13908) @vuule
[JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
Fixes a performance regression in FST (#13850) @elstehle
Set native handles to null on close in Java wrapper classes (#13818) @jlowe
Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
Update lists::contains to experimental row comparator (#13810) @divyegala
Reduce lists::contains dispatches for scalars (#13805) @divyegala
Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
Update to Cython 3.0.0 (#13777) @vyasr
Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
Branch 23.10 merge 23.08 (#13773) @vyasr
Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
Branch 23.10 merge 23.08 (#13753) @vyasr
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Refactors JSON reader's pushdown automaton (#13716) @elstehle
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

- C++
Published by rapids-bot[bot] almost 3 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v23.06.00

🔗 Links

🚨 Breaking Changes

Fix batch processing for parquet writer (#13438) @ttnghia
Use <NA> instead of null to match pandas. (#13415) @bdice
Remove UNKNOWNNULLCOUNT (#13372) @vyasr
Remove default UNKNOWNNULLCOUNT from cudf::column member functions (#13341) @davidwendt
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Remove null mask and null count from column_view constructors (#13311) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
Throw error if UNINITIALIZED is passed to cudf::statenullcount (#13292) @davidwendt
Remove default null-count parameter from cudf::makestringscolumn factory (#13227) @davidwendt
Remove UNKNOWNNULLCOUNT where it can be easily computed (#13205) @vyasr
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguoussplit API into contiguoussplit.hpp (#13186) @abellina
Cleanup Parquet chunked writer (#13094) @ttnghia
Cleanup ORC chunked writer (#13091) @ttnghia
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Remove deprecated regex functions from libcudf (#13067) @davidwendt
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Implement Python dropduplicates with cudf::stabledistinct. (#11656) @brandon-b-miller

🐛 Bug Fixes

Fix valid count computation in offsetbitmaskbinop kernel (#13489) @davidwendt
Fix writing of ORC files with empty rowgroups (#13466) @vuule
Fix cudf::repeat logic when count is zero (#13459) @davidwendt
Fix batch processing for parquet writer (#13438) @ttnghia
Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
Fix cudf::strings::replacewithbackrefs hang on empty match result (#13418) @davidwendt
Use <NA> instead of null to match pandas. (#13415) @bdice
Fix tokenize with non-space delimiter (#13403) @shwina
Fix groupby head/tail for empty dataframe (#13398) @shwina
Default to closed="right" in IntervalIndex constructor (#13394) @shwina
Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
Fix unused argument errors in nvcc 11.5 (#13387) @abellina
Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
Fix page size estimation in Parquet writer (#13364) @etseidl
Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
Support gcc 12 as the C++ compiler (#13316) @robertmaynard
Correctly set bitmask size in from_column_view (#13315) @wence-
Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
Fix parquet schema interpretation issue (#13277) @hyperbolic2346
Fix 64bit shift bug in avro reader (#13276) @karthikeyann
Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
Clean up buffers in case AssertionError (#13262) @razajafri
Allow empty input table in ast compute_column (#13245) @wence-
Fix structscolumnwrapper constructors to copy input column wrappers (#13243) @davidwendt
Fix the row index stream order in ORC reader (#13242) @vuule
Make is_decompression_disabled and is_compression_disabled thread-safe (#13240) @vuule
Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
Fix race in ORC string dictionary creation (#13214) @revans2
Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
Fix hostdevice_vector::subspan (#13187) @ttnghia
Use custom nvbench entry point to ensure cudf::nvbench_base_fixture usage (#13183) @robertmaynard
Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
Fix a few clang-format style check errors (#13146) @davidwendt
[REVIEW] Fix Series and DataFrame constructors to validate index lengths (#13122) @galipremsagar
Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
Fix GPUARCHS setting in Java CMake build and CMAKECUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
Adds checks to make sure json reader won't overflow (#13115) @elstehle
Fix null_count of columns returned by chunked_parquet_reader (#13111) @vuule
Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
[REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
Use makeemptylistscolumn instead of makeemptycolumn(typeid::LIST) (#13099) @davidwendt
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Fix column selection read_parquet benchmarks (#13082) @vuule
Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
Add algorithm include in data_sink.hpp (#13068) @ahendriksen
Fix tests/identifystreamusage.cpp (#13066) @ahendriksen
Prevent overflow with skip_rows in ORC and Parquet readers (#13063) @vuule
Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
[REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
Fix readavro() skiprows and num_rows. (#12912) @tpn
Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina

🚀 New Features

Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
Use compileor_get in JIT groupby apply (#13350) @brandon-b-miller
cuDF numba cuda 12 updates (#13337) @brandon-b-miller
Add tz_convert method to convert between timestamps (#13328) @shwina
Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
Support the case=False argument to str.contains (#13290) @shwina
Add an event handler for ColumnVector.close (#13279) @abellina
JNI api for cudf::chunked_pack (#13278) @abellina
Implement a chunked_pack API (#13260) @abellina
Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
JNI changes for range-extents in window functions. (#13199) @mythrocks
Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
Add IS_NULL operator to AST (#13145) @karthikeyann
STRING order-by column for RANGE window functions (#13143) @mythrocks
Update contains_table to experimental row hasher and equality comparator (#13119) @divyegala
Automatically select GroupBy.apply algorithm based on if the UDF is jittable (#13113) @brandon-b-miller
Refactor Parquet chunked writer (#13076) @ttnghia
Add Python bindings for string literal support in AST (#13073) @karthikeyann
Add Java bindings for string literal support in AST (#13072) @karthikeyann
Add string scalar support in AST (#13061) @karthikeyann
Log cuIO warnings using the libcudf logger (#13043) @vuule
Update mixed_join to use experimental row hasher and comparator (#13028) @divyegala
Support structs of lists in row lexicographic comparator (#13005) @ttnghia
Adding hostdevice_span that is a span createable from hostdevice_vector (#12981) @hyperbolic2346
Add nvtext::minhash function (#12961) @davidwendt
Support lists of structs in row lexicographic comparator (#12953) @ttnghia
Update join to use experimental row hasher and comparator (#12787) @divyegala
Implement Python dropduplicates with cudf::stabledistinct. (#11656) @brandon-b-miller

🛠️ Improvements

Bump typing_extensions minimum version to 4.0.0 (#13618) @shwina
Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
Handle some corner-cases in indexing with boolean masks (#13402) @wence-
Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
[JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
Fix JNI method with mismatched parameter list (#13384) @ttnghia
Split up experimentalrowoperator_tests.cu to improve its compile time (#13382) @davidwendt
Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
Remove UNKNOWNNULLCOUNT (#13372) @vyasr
Move some nvtext benchmarks to nvbench (#13368) @davidwendt
run docs nightly too (#13366) @AyodeAwe
Add warning for default dtype parameter in get_dummies (#13365) @galipremsagar
Add log messages about kvikIO compatibility mode (#13363) @vuule
Switch back to using primary shared-action-workflows branch (#13362) @vyasr
Deprecate StringIndex and use Index instead (#13361) @galipremsagar
Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
Expunge most uses of TypeVar(bound="Foo") (#13346) @wence-
Remove all references to UNKNOWNNULLCOUNT in Python (#13345) @vyasr
Improve distinct_count with cuco::static_set (#13343) @PointKernel
Fix contiguous_split performance (#13342) @ttnghia
Remove default UNKNOWNNULLCOUNT from cudf::column member functions (#13341) @davidwendt
Update mypy to 1.3 (#13340) @wence-
[Java] Purge non-empty nulls when setting validity (#13335) @razajafri
Add row-wise filtering step to read_parquet (#13334) @rjzamora
Performance improvement for nvtext::minhash (#13333) @davidwendt
Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
Move meta calculation in dask_cudf.read_parquet (#13327) @rjzamora
Changes to support Numpy >= 1.24 (#13325) @shwina
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Clean up distinct_count benchmark (#13321) @PointKernel
Fix gtest pinning to 1.13.0. (#13319) @bdice
Remove null mask and null count from column_view constructors (#13311) @vyasr
Address feedback from 13289 (#13306) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
First check for BaseDtype when infering the data type of an arbitrary object (#13295) @shwina
Throw error if UNINITIALIZED is passed to cudf::statenullcount (#13292) @davidwendt
Support CUDA 12.0 for pip wheels (#13289) @divyegala
Refactor transform_lists_of_structs in row_operators.cu (#13288) @ttnghia
Branch 23.06 merge 23.04 (#13286) @vyasr
Update cupy dependency (#13284) @vyasr
Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
Fix unused variables and functions (#13275) @karthikeyann
Fix integer overflow in partition scatter_map construction (#13272) @wence-
Numba 0.57 compatibility fixes (#13271) @gmarkall
Performance improvement in cudf::strings::allcharactersof_type (#13259) @davidwendt
Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
Build wheels using new single image workflow (#13249) @vyasr
Enable sccache hits from local builds (#13248) @AyodeAwe
Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
Introduce pandas_compatible option in cudf (#13241) @galipremsagar
Add metadata_builder helper class (#13232) @abellina
Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
Remove default null-count parameter from cudf::makestringscolumn factory (#13227) @davidwendt
Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
Add chunked reader benchmark (#13223) @SrikarVanavasam
Set the null count in output columns in the CSV reader (#13221) @vuule
Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
Fix stringscalar stream usage in writejson.cu (#13212) @davidwendt
Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
Remove UNKNOWNNULLCOUNT where it can be easily computed (#13205) @vyasr
Optimization to decoding of parquet level streams (#13203) @nvdbaranec
Clean up and simplify gpuDecideCompression (#13202) @vuule
Use std::array for a statically sized vector in create_serialized_trie (#13201) @vuule
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguoussplit API into contiguoussplit.hpp (#13186) @abellina
Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
Split up unique_count.cu to improve build time (#13169) @davidwendt
Use nvtx3 includes in string examples. (#13165) @bdice
Change some .cu gtest files to .cpp (#13155) @davidwendt
Remove wheel pytest verbosity (#13151) @sevagh
Fix libcudf to always pass null-count to setnullmask (#13149) @davidwendt
Fix gtests to always pass null-count to setnullmask calls (#13148) @davidwendt
Optimize JSON writer (#13144) @karthikeyann
Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
[REVIEW] Deprecate pad and backfill methods (#13140) @galipremsagar
Use CTAD instead of functions in ProtobufReader (#13135) @vuule
Remove more instances of UNKNOWN_NULL_COUNT (#13134) @vyasr
Update clang-format to 16.0.1. (#13133) @bdice
Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
Branch 23.06 merge 23.04 (#13131) @vyasr
Compute null-count in cudf::detail::slice (#13124) @davidwendt
Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
Set null-count in linkedcolumnview conversion operator (#13121) @davidwendt
Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
Remove uses-setup-env-vars (#13105) @vyasr
Explicitly compute null count in concatenate APIs (#13104) @vyasr
Replace unnecessary uses of UNKNOWN_NULL_COUNT (#13102) @vyasr
Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
Use .element() instead of .data() for window range calculations (#13095) @mythrocks
Cleanup Parquet chunked writer (#13094) @ttnghia
Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
Cleanup ORC chunked writer (#13091) @ttnghia
Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
Change cudf::test::makenullmask to also return null-count (#13081) @davidwendt
Resolved automerger from branch-23.04 to branch-23.06 (#13080) @galipremsagar
Assert for non-empty nulls (#13071) @razajafri
Remove deprecated regex functions from libcudf (#13067) @davidwendt
Refactor cudf::detail::sorted_order (#13062) @ttnghia
Improve performance of slice_strings for long strings (#13057) @davidwendt
Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
[REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
Remove console output from some libcudf gtests (#13027) @davidwendt
Remove underscore in build string. (#13025) @bdice
Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
Fix auto merger from branch-23.04 to branch-23.06 (#13009) @galipremsagar
Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
Add nvtx annotatations to groupby methods (#12941) @wence-
Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
Optimize set-like operations (#12769) @ttnghia
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Add empty test files for test reorganization (#12288) @shwina

- C++
Published by rapids-bot[bot] almost 3 years ago

https://github.com/rapidsai/cudf - v23.06.01

🚨 Breaking Changes

Fix batch processing for parquet writer (#13438) @ttnghia
Use <NA> instead of null to match pandas. (#13415) @bdice
Remove UNKNOWNNULLCOUNT (#13372) @vyasr
Remove default UNKNOWNNULLCOUNT from cudf::column member functions (#13341) @davidwendt
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Remove null mask and null count from column_view constructors (#13311) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
Throw error if UNINITIALIZED is passed to cudf::statenullcount (#13292) @davidwendt
Remove default null-count parameter from cudf::makestringscolumn factory (#13227) @davidwendt
Remove UNKNOWNNULLCOUNT where it can be easily computed (#13205) @vyasr
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguoussplit API into contiguoussplit.hpp (#13186) @abellina
Cleanup Parquet chunked writer (#13094) @ttnghia
Cleanup ORC chunked writer (#13091) @ttnghia
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Remove deprecated regex functions from libcudf (#13067) @davidwendt
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Implement Python dropduplicates with cudf::stabledistinct. (#11656) @brandon-b-miller

🐛 Bug Fixes

Fix valid count computation in offsetbitmaskbinop kernel (#13489) @davidwendt
Fix writing of ORC files with empty rowgroups (#13466) @vuule
Fix cudf::repeat logic when count is zero (#13459) @davidwendt
Fix batch processing for parquet writer (#13438) @ttnghia
Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
Fix cudf::strings::replacewithbackrefs hang on empty match result (#13418) @davidwendt
Use <NA> instead of null to match pandas. (#13415) @bdice
Fix tokenize with non-space delimiter (#13403) @shwina
Fix groupby head/tail for empty dataframe (#13398) @shwina
Default to closed="right" in IntervalIndex constructor (#13394) @shwina
Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
Fix unused argument errors in nvcc 11.5 (#13387) @abellina
Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
Fix page size estimation in Parquet writer (#13364) @etseidl
Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
Support gcc 12 as the C++ compiler (#13316) @robertmaynard
Correctly set bitmask size in from_column_view (#13315) @wence-
Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
Fix parquet schema interpretation issue (#13277) @hyperbolic2346
Fix 64bit shift bug in avro reader (#13276) @karthikeyann
Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
Clean up buffers in case AssertionError (#13262) @razajafri
Allow empty input table in ast compute_column (#13245) @wence-
Fix structscolumnwrapper constructors to copy input column wrappers (#13243) @davidwendt
Fix the row index stream order in ORC reader (#13242) @vuule
Make is_decompression_disabled and is_compression_disabled thread-safe (#13240) @vuule
Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
Fix race in ORC string dictionary creation (#13214) @revans2
Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
Fix hostdevice_vector::subspan (#13187) @ttnghia
Use custom nvbench entry point to ensure cudf::nvbench_base_fixture usage (#13183) @robertmaynard
Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
Fix a few clang-format style check errors (#13146) @davidwendt
[REVIEW] Fix Series and DataFrame constructors to validate index lengths (#13122) @galipremsagar
Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
Fix GPUARCHS setting in Java CMake build and CMAKECUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
Adds checks to make sure json reader won't overflow (#13115) @elstehle
Fix null_count of columns returned by chunked_parquet_reader (#13111) @vuule
Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
[REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
Use makeemptylistscolumn instead of makeemptycolumn(typeid::LIST) (#13099) @davidwendt
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Fix column selection read_parquet benchmarks (#13082) @vuule
Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
Add algorithm include in data_sink.hpp (#13068) @ahendriksen
Fix tests/identifystreamusage.cpp (#13066) @ahendriksen
Prevent overflow with skip_rows in ORC and Parquet readers (#13063) @vuule
Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
[REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
Fix readavro() skiprows and num_rows. (#12912) @tpn
Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina

🚀 New Features

Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
Use compileor_get in JIT groupby apply (#13350) @brandon-b-miller
cuDF numba cuda 12 updates (#13337) @brandon-b-miller
Add tz_convert method to convert between timestamps (#13328) @shwina
Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
Support the case=False argument to str.contains (#13290) @shwina
Add an event handler for ColumnVector.close (#13279) @abellina
JNI api for cudf::chunked_pack (#13278) @abellina
Implement a chunked_pack API (#13260) @abellina
Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
JNI changes for range-extents in window functions. (#13199) @mythrocks
Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
Add IS_NULL operator to AST (#13145) @karthikeyann
STRING order-by column for RANGE window functions (#13143) @mythrocks
Update contains_table to experimental row hasher and equality comparator (#13119) @divyegala
Automatically select GroupBy.apply algorithm based on if the UDF is jittable (#13113) @brandon-b-miller
Refactor Parquet chunked writer (#13076) @ttnghia
Add Python bindings for string literal support in AST (#13073) @karthikeyann
Add Java bindings for string literal support in AST (#13072) @karthikeyann
Add string scalar support in AST (#13061) @karthikeyann
Log cuIO warnings using the libcudf logger (#13043) @vuule
Update mixed_join to use experimental row hasher and comparator (#13028) @divyegala
Support structs of lists in row lexicographic comparator (#13005) @ttnghia
Adding hostdevice_span that is a span createable from hostdevice_vector (#12981) @hyperbolic2346
Add nvtext::minhash function (#12961) @davidwendt
Support lists of structs in row lexicographic comparator (#12953) @ttnghia
Update join to use experimental row hasher and comparator (#12787) @divyegala
Implement Python dropduplicates with cudf::stabledistinct. (#11656) @brandon-b-miller

🛠️ Improvements

Bump typing_extensions minimum version to 4.0.0 (#13618) @shwina
Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
Handle some corner-cases in indexing with boolean masks (#13402) @wence-
Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
[JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
Fix JNI method with mismatched parameter list (#13384) @ttnghia
Split up experimentalrowoperator_tests.cu to improve its compile time (#13382) @davidwendt
Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
Remove UNKNOWNNULLCOUNT (#13372) @vyasr
Move some nvtext benchmarks to nvbench (#13368) @davidwendt
run docs nightly too (#13366) @AyodeAwe
Add warning for default dtype parameter in get_dummies (#13365) @galipremsagar
Add log messages about kvikIO compatibility mode (#13363) @vuule
Switch back to using primary shared-action-workflows branch (#13362) @vyasr
Deprecate StringIndex and use Index instead (#13361) @galipremsagar
Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
Expunge most uses of TypeVar(bound="Foo") (#13346) @wence-
Remove all references to UNKNOWNNULLCOUNT in Python (#13345) @vyasr
Improve distinct_count with cuco::static_set (#13343) @PointKernel
Fix contiguous_split performance (#13342) @ttnghia
Remove default UNKNOWNNULLCOUNT from cudf::column member functions (#13341) @davidwendt
Update mypy to 1.3 (#13340) @wence-
[Java] Purge non-empty nulls when setting validity (#13335) @razajafri
Add row-wise filtering step to read_parquet (#13334) @rjzamora
Performance improvement for nvtext::minhash (#13333) @davidwendt
Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
Move meta calculation in dask_cudf.read_parquet (#13327) @rjzamora
Changes to support Numpy >= 1.24 (#13325) @shwina
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Clean up distinct_count benchmark (#13321) @PointKernel
Fix gtest pinning to 1.13.0. (#13319) @bdice
Remove null mask and null count from column_view constructors (#13311) @vyasr
Address feedback from 13289 (#13306) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
First check for BaseDtype when infering the data type of an arbitrary object (#13295) @shwina
Throw error if UNINITIALIZED is passed to cudf::statenullcount (#13292) @davidwendt
Support CUDA 12.0 for pip wheels (#13289) @divyegala
Refactor transform_lists_of_structs in row_operators.cu (#13288) @ttnghia
Branch 23.06 merge 23.04 (#13286) @vyasr
Update cupy dependency (#13284) @vyasr
Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
Fix unused variables and functions (#13275) @karthikeyann
Fix integer overflow in partition scatter_map construction (#13272) @wence-
Numba 0.57 compatibility fixes (#13271) @gmarkall
Performance improvement in cudf::strings::allcharactersof_type (#13259) @davidwendt
Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
Build wheels using new single image workflow (#13249) @vyasr
Enable sccache hits from local builds (#13248) @AyodeAwe
Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
Introduce pandas_compatible option in cudf (#13241) @galipremsagar
Add metadata_builder helper class (#13232) @abellina
Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
Remove default null-count parameter from cudf::makestringscolumn factory (#13227) @davidwendt
Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
Add chunked reader benchmark (#13223) @SrikarVanavasam
Set the null count in output columns in the CSV reader (#13221) @vuule
Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
Fix stringscalar stream usage in writejson.cu (#13212) @davidwendt
Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
Remove UNKNOWNNULLCOUNT where it can be easily computed (#13205) @vyasr
Optimization to decoding of parquet level streams (#13203) @nvdbaranec
Clean up and simplify gpuDecideCompression (#13202) @vuule
Use std::array for a statically sized vector in create_serialized_trie (#13201) @vuule
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguoussplit API into contiguoussplit.hpp (#13186) @abellina
Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
Split up unique_count.cu to improve build time (#13169) @davidwendt
Use nvtx3 includes in string examples. (#13165) @bdice
Change some .cu gtest files to .cpp (#13155) @davidwendt
Remove wheel pytest verbosity (#13151) @sevagh
Fix libcudf to always pass null-count to setnullmask (#13149) @davidwendt
Fix gtests to always pass null-count to setnullmask calls (#13148) @davidwendt
Optimize JSON writer (#13144) @karthikeyann
Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
[REVIEW] Deprecate pad and backfill methods (#13140) @galipremsagar
Use CTAD instead of functions in ProtobufReader (#13135) @vuule
Remove more instances of UNKNOWN_NULL_COUNT (#13134) @vyasr
Update clang-format to 16.0.1. (#13133) @bdice
Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
Branch 23.06 merge 23.04 (#13131) @vyasr
Compute null-count in cudf::detail::slice (#13124) @davidwendt
Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
Set null-count in linkedcolumnview conversion operator (#13121) @davidwendt
Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
Remove uses-setup-env-vars (#13105) @vyasr
Explicitly compute null count in concatenate APIs (#13104) @vyasr
Replace unnecessary uses of UNKNOWN_NULL_COUNT (#13102) @vyasr
Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
Use .element() instead of .data() for window range calculations (#13095) @mythrocks
Cleanup Parquet chunked writer (#13094) @ttnghia
Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
Cleanup ORC chunked writer (#13091) @ttnghia
Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
Change cudf::test::makenullmask to also return null-count (#13081) @davidwendt
Resolved automerger from branch-23.04 to branch-23.06 (#13080) @galipremsagar
Assert for non-empty nulls (#13071) @razajafri
Remove deprecated regex functions from libcudf (#13067) @davidwendt
Refactor cudf::detail::sorted_order (#13062) @ttnghia
Improve performance of slice_strings for long strings (#13057) @davidwendt
Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
[REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
Remove console output from some libcudf gtests (#13027) @davidwendt
Remove underscore in build string. (#13025) @bdice
Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
Fix auto merger from branch-23.04 to branch-23.06 (#13009) @galipremsagar
Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
Add nvtx annotatations to groupby methods (#12941) @wence-
Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
Optimize set-like operations (#12769) @ttnghia
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Add empty test files for test reorganization (#12288) @shwina

- C++
Published by raydouglass almost 3 years ago

https://github.com/rapidsai/cudf - v23.06.00

🚨 Breaking Changes

Fix batch processing for parquet writer (#13438) @ttnghia
Use <NA> instead of null to match pandas. (#13415) @bdice
Remove UNKNOWNNULLCOUNT (#13372) @vyasr
Remove default UNKNOWNNULLCOUNT from cudf::column member functions (#13341) @davidwendt
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Remove null mask and null count from column_view constructors (#13311) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
Throw error if UNINITIALIZED is passed to cudf::statenullcount (#13292) @davidwendt
Remove default null-count parameter from cudf::makestringscolumn factory (#13227) @davidwendt
Remove UNKNOWNNULLCOUNT where it can be easily computed (#13205) @vyasr
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguoussplit API into contiguoussplit.hpp (#13186) @abellina
Cleanup Parquet chunked writer (#13094) @ttnghia
Cleanup ORC chunked writer (#13091) @ttnghia
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Remove deprecated regex functions from libcudf (#13067) @davidwendt
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Implement Python dropduplicates with cudf::stabledistinct. (#11656) @brandon-b-miller

🐛 Bug Fixes

Fix valid count computation in offsetbitmaskbinop kernel (#13489) @davidwendt
Fix writing of ORC files with empty rowgroups (#13466) @vuule
Fix cudf::repeat logic when count is zero (#13459) @davidwendt
Fix batch processing for parquet writer (#13438) @ttnghia
Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
Fix cudf::strings::replacewithbackrefs hang on empty match result (#13418) @davidwendt
Use <NA> instead of null to match pandas. (#13415) @bdice
Fix tokenize with non-space delimiter (#13403) @shwina
Fix groupby head/tail for empty dataframe (#13398) @shwina
Default to closed="right" in IntervalIndex constructor (#13394) @shwina
Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
Fix unused argument errors in nvcc 11.5 (#13387) @abellina
Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
Fix page size estimation in Parquet writer (#13364) @etseidl
Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
Support gcc 12 as the C++ compiler (#13316) @robertmaynard
Correctly set bitmask size in from_column_view (#13315) @wence-
Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
Fix parquet schema interpretation issue (#13277) @hyperbolic2346
Fix 64bit shift bug in avro reader (#13276) @karthikeyann
Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
Clean up buffers in case AssertionError (#13262) @razajafri
Allow empty input table in ast compute_column (#13245) @wence-
Fix structscolumnwrapper constructors to copy input column wrappers (#13243) @davidwendt
Fix the row index stream order in ORC reader (#13242) @vuule
Make is_decompression_disabled and is_compression_disabled thread-safe (#13240) @vuule
Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
Fix race in ORC string dictionary creation (#13214) @revans2
Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
Fix hostdevice_vector::subspan (#13187) @ttnghia
Use custom nvbench entry point to ensure cudf::nvbench_base_fixture usage (#13183) @robertmaynard
Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
Fix a few clang-format style check errors (#13146) @davidwendt
[REVIEW] Fix Series and DataFrame constructors to validate index lengths (#13122) @galipremsagar
Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
Fix GPUARCHS setting in Java CMake build and CMAKECUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
Adds checks to make sure json reader won't overflow (#13115) @elstehle
Fix null_count of columns returned by chunked_parquet_reader (#13111) @vuule
Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
[REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
Use makeemptylistscolumn instead of makeemptycolumn(typeid::LIST) (#13099) @davidwendt
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Fix column selection read_parquet benchmarks (#13082) @vuule
Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
Add algorithm include in data_sink.hpp (#13068) @ahendriksen
Fix tests/identifystreamusage.cpp (#13066) @ahendriksen
Prevent overflow with skip_rows in ORC and Parquet readers (#13063) @vuule
Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
[REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
Fix readavro() skiprows and num_rows. (#12912) @tpn
Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina

🚀 New Features

Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
Use compileor_get in JIT groupby apply (#13350) @brandon-b-miller
cuDF numba cuda 12 updates (#13337) @brandon-b-miller
Add tz_convert method to convert between timestamps (#13328) @shwina
Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
Support the case=False argument to str.contains (#13290) @shwina
Add an event handler for ColumnVector.close (#13279) @abellina
JNI api for cudf::chunked_pack (#13278) @abellina
Implement a chunked_pack API (#13260) @abellina
Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
JNI changes for range-extents in window functions. (#13199) @mythrocks
Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
Add IS_NULL operator to AST (#13145) @karthikeyann
STRING order-by column for RANGE window functions (#13143) @mythrocks
Update contains_table to experimental row hasher and equality comparator (#13119) @divyegala
Automatically select GroupBy.apply algorithm based on if the UDF is jittable (#13113) @brandon-b-miller
Refactor Parquet chunked writer (#13076) @ttnghia
Add Python bindings for string literal support in AST (#13073) @karthikeyann
Add Java bindings for string literal support in AST (#13072) @karthikeyann
Add string scalar support in AST (#13061) @karthikeyann
Log cuIO warnings using the libcudf logger (#13043) @vuule
Update mixed_join to use experimental row hasher and comparator (#13028) @divyegala
Support structs of lists in row lexicographic comparator (#13005) @ttnghia
Adding hostdevice_span that is a span createable from hostdevice_vector (#12981) @hyperbolic2346
Add nvtext::minhash function (#12961) @davidwendt
Support lists of structs in row lexicographic comparator (#12953) @ttnghia
Update join to use experimental row hasher and comparator (#12787) @divyegala
Implement Python dropduplicates with cudf::stabledistinct. (#11656) @brandon-b-miller

🛠️ Improvements

Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
Handle some corner-cases in indexing with boolean masks (#13402) @wence-
Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
[JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
Fix JNI method with mismatched parameter list (#13384) @ttnghia
Split up experimentalrowoperator_tests.cu to improve its compile time (#13382) @davidwendt
Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
Remove UNKNOWNNULLCOUNT (#13372) @vyasr
Move some nvtext benchmarks to nvbench (#13368) @davidwendt
run docs nightly too (#13366) @AyodeAwe
Add warning for default dtype parameter in get_dummies (#13365) @galipremsagar
Add log messages about kvikIO compatibility mode (#13363) @vuule
Switch back to using primary shared-action-workflows branch (#13362) @vyasr
Deprecate StringIndex and use Index instead (#13361) @galipremsagar
Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
Expunge most uses of TypeVar(bound="Foo") (#13346) @wence-
Remove all references to UNKNOWNNULLCOUNT in Python (#13345) @vyasr
Improve distinct_count with cuco::static_set (#13343) @PointKernel
Fix contiguous_split performance (#13342) @ttnghia
Remove default UNKNOWNNULLCOUNT from cudf::column member functions (#13341) @davidwendt
Update mypy to 1.3 (#13340) @wence-
[Java] Purge non-empty nulls when setting validity (#13335) @razajafri
Add row-wise filtering step to read_parquet (#13334) @rjzamora
Performance improvement for nvtext::minhash (#13333) @davidwendt
Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
Move meta calculation in dask_cudf.read_parquet (#13327) @rjzamora
Changes to support Numpy >= 1.24 (#13325) @shwina
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Clean up distinct_count benchmark (#13321) @PointKernel
Fix gtest pinning to 1.13.0. (#13319) @bdice
Remove null mask and null count from column_view constructors (#13311) @vyasr
Address feedback from 13289 (#13306) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
First check for BaseDtype when infering the data type of an arbitrary object (#13295) @shwina
Throw error if UNINITIALIZED is passed to cudf::statenullcount (#13292) @davidwendt
Support CUDA 12.0 for pip wheels (#13289) @divyegala
Refactor transform_lists_of_structs in row_operators.cu (#13288) @ttnghia
Branch 23.06 merge 23.04 (#13286) @vyasr
Update cupy dependency (#13284) @vyasr
Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
Fix unused variables and functions (#13275) @karthikeyann
Fix integer overflow in partition scatter_map construction (#13272) @wence-
Numba 0.57 compatibility fixes (#13271) @gmarkall
Performance improvement in cudf::strings::allcharactersof_type (#13259) @davidwendt
Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
Build wheels using new single image workflow (#13249) @vyasr
Enable sccache hits from local builds (#13248) @AyodeAwe
Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
Introduce pandas_compatible option in cudf (#13241) @galipremsagar
Add metadata_builder helper class (#13232) @abellina
Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
Remove default null-count parameter from cudf::makestringscolumn factory (#13227) @davidwendt
Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
Add chunked reader benchmark (#13223) @SrikarVanavasam
Set the null count in output columns in the CSV reader (#13221) @vuule
Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
Fix stringscalar stream usage in writejson.cu (#13212) @davidwendt
Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
Remove UNKNOWNNULLCOUNT where it can be easily computed (#13205) @vyasr
Optimization to decoding of parquet level streams (#13203) @nvdbaranec
Clean up and simplify gpuDecideCompression (#13202) @vuule
Use std::array for a statically sized vector in create_serialized_trie (#13201) @vuule
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguoussplit API into contiguoussplit.hpp (#13186) @abellina
Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
Split up unique_count.cu to improve build time (#13169) @davidwendt
Use nvtx3 includes in string examples. (#13165) @bdice
Change some .cu gtest files to .cpp (#13155) @davidwendt
Remove wheel pytest verbosity (#13151) @sevagh
Fix libcudf to always pass null-count to setnullmask (#13149) @davidwendt
Fix gtests to always pass null-count to setnullmask calls (#13148) @davidwendt
Optimize JSON writer (#13144) @karthikeyann
Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
[REVIEW] Deprecate pad and backfill methods (#13140) @galipremsagar
Use CTAD instead of functions in ProtobufReader (#13135) @vuule
Remove more instances of UNKNOWN_NULL_COUNT (#13134) @vyasr
Update clang-format to 16.0.1. (#13133) @bdice
Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
Branch 23.06 merge 23.04 (#13131) @vyasr
Compute null-count in cudf::detail::slice (#13124) @davidwendt
Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
Set null-count in linkedcolumnview conversion operator (#13121) @davidwendt
Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
Remove uses-setup-env-vars (#13105) @vyasr
Explicitly compute null count in concatenate APIs (#13104) @vyasr
Replace unnecessary uses of UNKNOWN_NULL_COUNT (#13102) @vyasr
Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
Use .element() instead of .data() for window range calculations (#13095) @mythrocks
Cleanup Parquet chunked writer (#13094) @ttnghia
Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
Cleanup ORC chunked writer (#13091) @ttnghia
Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
Change cudf::test::makenullmask to also return null-count (#13081) @davidwendt
Resolved automerger from branch-23.04 to branch-23.06 (#13080) @galipremsagar
Assert for non-empty nulls (#13071) @razajafri
Remove deprecated regex functions from libcudf (#13067) @davidwendt
Refactor cudf::detail::sorted_order (#13062) @ttnghia
Improve performance of slice_strings for long strings (#13057) @davidwendt
Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
[REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
Remove console output from some libcudf gtests (#13027) @davidwendt
Remove underscore in build string. (#13025) @bdice
Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
Fix auto merger from branch-23.04 to branch-23.06 (#13009) @galipremsagar
Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
Add nvtx annotatations to groupby methods (#12941) @wence-
Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
Optimize set-like operations (#12769) @ttnghia
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Add empty test files for test reorganization (#12288) @shwina

- C++
Published by raydouglass almost 3 years ago

https://github.com/rapidsai/cudf - v23.04.00

🚨 Breaking Changes

Pin dask and distributed for release (#13070) @galipremsagar
Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
Update minimum pandas and numpy pinnings (#12887) @galipremsagar
Deprecate names & dtype in Index.copy (#12825) @galipremsagar
Deprecate Index.is_* methods (#12820) @galipremsagar
Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
Deprecate na_sentinel in factorize (#12817) @galipremsagar
Make string methods return a Series with a useful Index (#12814) @shwina
Produce useful guidance on overflow error in to_csv (#12705) @wence-
Move strings_udf code into cuDF (#12669) @brandon-b-miller
Remove cudf::strings::repeatstringsoutputsizes and optional parameter from cudf::strings::repeatstrings (#12609) @davidwendt
Replace message parsing with throwing more specific exceptions (#12426) @vyasr

🐛 Bug Fixes

Fix memcheck script to execute only _TEST files found in bin/gtests/libcudf (#13006) @davidwendt
Fix DataFrame constructor to broadcast scalar inputs properly (#12997) @galipremsagar
Drop force_nullable_schema from chunked parquet writer (#12996) @galipremsagar
Fix gtest column utility comparator diff reporting (#12995) @davidwendt
Handle index names while performing groupby (#12992) @galipremsagar
Fix __setitem__ on string columns when the scalar value ends in a null byte (#12991) @wence-
Fix sort_values when column is all empty strings (#12988) @eriknw
Remove unused variable and fix memory issue in ORC writer (#12984) @ttnghia
Pre-emptive fix for upstream dask.dataframe.read_parquet changes (#12983) @rjzamora
Remove MANIFEST.in use auto-generated one for sdists and package_data for wheels (#12960) @vyasr
Update to use rapids-export(COMPONENTS) feature. (#12959) @robertmaynard
cudftestutil supports static gtest dependencies (#12957) @robertmaynard
Include gtest in build environment. (#12956) @vyasr
Correctly handle scalar indices in Index.__getitem__ (#12955) @wence-
Avoid building cython twice (#12945) @galipremsagar
Fix set index error for Series rolling window operations (#12942) @galipremsagar
Fix calculation of null counts for Parquet statistics (#12938) @etseidl
Preserve integer dtype of hive-partitioned column containing nulls (#12930) @rjzamora
Use getcurrentdeviceresource for intermediate allocations in COLLECTLIST window code (#12927) @karthikeyann
Mark dlpack tensor deleter as noexcept to match PyCapsule_Destructor signature. (#12921) @bdice
Fix conda recipe post-link.sh typo (#12916) @pentschev
minrows and numrows are swapped in ComputePageSizes declaration in Parquet reader (#12886) @etseidl
Expect cupy to now support bool arrays for dlpack. (#12883) @vyasr
Use python -m pytest for nightly wheel tests (#12871) @bdice
Parquet writer columnsize() should return a sizet (#12870) @etseidl
Fix cudf::hash_partition kernel launch error with decimal128 types (#12863) @davidwendt
Fix an issue with parquet chunked reader undercounting string lengths. (#12859) @nvdbaranec
Remove tokenizers pre-install pinning. (#12854) @vyasr
Fix parquet RangeIndex bug (#12838) @rjzamora
Remove KAFKAHOSTTEST from compute-sanitizer check (#12831) @davidwendt
Make string methods return a Series with a useful Index (#12814) @shwina
Tell cudf_kafka to use header-only fmt (#12796) @vyasr
Add GroupBy.dtypes (#12783) @galipremsagar
Fix a leak in a test and clarify some test names (#12781) @revans2
Fix bug in all-null list due to joinlistelements special handling (#12767) @karthikeyann
Add try/except for expected null-schema error in read_parquet (#12756) @rjzamora
Throw an exception if an unsupported page encoding is detected in Parquet reader (#12754) @etseidl
Fix a bug with num_keys in _scatter_by_slice (#12749) @thomcom
Bump pinned rapids wheel deps to 23.4 (#12735) @sevagh
Rework logic in cudf::strings::split_record to improve performance (#12729) @davidwendt
Add always_nullable flag to Dremel encoding (#12727) @divyegala
Fix memcheck read error in compound segmented reduce (#12722) @davidwendt
Fix faulty conditional logic in JIT GroupBy.apply (#12706) @brandon-b-miller
Produce useful guidance on overflow error in to_csv (#12705) @wence-
Handle parquet list data corner case (#12698) @nvdbaranec
Fix missing trailing comma in json writer (#12688) @karthikeyann
Remove child fom newCudaAsyncMemoryResource (#12681) @abellina
Handle bool types in round API (#12670) @galipremsagar
Ensure all of device bitmask is initialized in from_arrow (#12668) @wence-
Fix from_arrow to load a sliced arrow table (#12665) @galipremsagar
Fix dask-cudf read_parquet bug for multi-file aggregation (#12663) @rjzamora
Fix AllocateLikeTest gtests reading uninitialized null-mask (#12643) @davidwendt
Fix find_common_dtype and values to handle complex dtypes (#12537) @galipremsagar
Fix fetching of MultiIndex values when a label is passed (#12521) @galipremsagar
Fix Series comparison vs scalars (#12519) @brandon-b-miller
Allow casting from UDFString back to StringView to call methods in strings_udf (#12363) @brandon-b-miller

📖 Documentation

Fix GroupBy.apply doc examples rendering (#12994) @brandon-b-miller
add sphinx building and s3 uploading for dask-cudf docs (#12982) @quasiben
Add developer documentation forbidding default parameters in detail APIs (#12978) @vyasr
Add README symlink for dask-cudf. (#12946) @bdice
Remove return type from @return doxygen tags (#12908) @davidwendt
Fix docs build to be pydata-sphinx-theme=0.13.0 compatible (#12874) @galipremsagar
Add skeleton API and prose documentation for dask-cudf (#12725) @wence-
Enable doctests for GroupBy methods (#12658) @brandon-b-miller
Add comment about CUB patch for SegmentedSortInt.Bool gtest (#12611) @davidwendt

🚀 New Features

Add JNI method for strings::replace multi variety (#12979) @NVnavkumar
Add nunique aggregation support for cudf::segmented_reduce (#12972) @davidwendt
Refactor orc chunked writer (#12949) @ttnghia
Make Parquet writer nullable option application to single table writes (#12933) @vuule
Refactor io::orc::ProtobufWriter (#12877) @ttnghia
Make timezone table independent from ORC (#12805) @vuule
Cache JIT GroupBy.apply functions (#12802) @brandon-b-miller
Implement initial support for avro logical types (#6482) (#12788) @tpn
Update tests/column_utilities to use experimental::equality row comparator (#12777) @divyegala
Update distinct/unique_count to experimental::row hasher/comparator (#12776) @divyegala
Update hash_partition to use experimental::row::row_hasher (#12761) @divyegala
Update is_sorted to use experimental::row::lexicographic (#12752) @divyegala
Update default data source in cuio reader benchmarks (#12740) @PointKernel
Reenable stream identification library in CI (#12714) @vyasr
Add regex_program strings splitting java APIs and tests (#12713) @cindyyuanjiang
Add regex_program strings replacing java APIs and tests (#12701) @cindyyuanjiang
Add regex_program strings extract java APIs and tests (#12699) @cindyyuanjiang
Variable fragment sizes for Parquet writer (#12685) @etseidl
Add segmented reduction support for fixed-point types (#12680) @davidwendt
Move strings_udf code into cuDF (#12669) @brandon-b-miller
Add regex_program searching APIs and related java classes (#12666) @cindyyuanjiang
Add logging to libcudf (#12637) @vuule
Add compound aggregations to cudf::segmented_reduce (#12573) @davidwendt
Convert rank to use to experimental row comparators (#12481) @divyegala
Use rapids-cmake parallel testing feature (#12451) @robertmaynard
Enable detection of undesired stream usage (#12089) @vyasr

🛠️ Improvements

Pin dask and distributed for release (#13070) @galipremsagar
Pin cupy in wheel tests to supported versions (#13041) @vyasr
Pin numba version (#13001) @vyasr
Rework gtests SequenceTest to remove using namepace cudf (#12985) @davidwendt
Stop setting package version attribute in wheels (#12977) @vyasr
Move detail reduction functions to cudf::reduction::detail namespace (#12971) @davidwendt
Remove default detail mrs: part7 (#12970) @vyasr
Remove default detail mrs: part6 (#12969) @vyasr
Remove default detail mrs: part5 (#12968) @vyasr
Remove default detail mrs: part4 (#12967) @vyasr
Remove default detail mrs: part3 (#12966) @vyasr
Remove default detail mrs: part2 (#12965) @vyasr
Remove default detail mrs: part1 (#12964) @vyasr
Add force_nullable_schema parameter to Parquet writer. (#12952) @galipremsagar
Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
Remove remaining default stream parameters (#12943) @vyasr
Fix cudf::segmented_reduce gtest for ANY aggregation (#12940) @davidwendt
Implement groupby.head and groupby.tail (#12939) @wence-
Fix libcudf gtests to pass null-count=0 for empty validity masks (#12923) @davidwendt
Migrate parquet encoding to use experimental row operators (#12918) @PointKernel
Fix benchmarks coded in namespace cudf and using namespace cudf (#12915) @karthikeyann
Fix io/text gtests coded in namespace cudf::test (#12914) @karthikeyann
Pass SCCACHE_S3_USE_SSL to conda builds (#12910) @ajschmidt8
Fix FST, JSON gtests & benchmarks coded in namespace cudf::test (#12907) @karthikeyann
Generate pyproject dependencies using dfg (#12906) @vyasr
Update libcudf counting functions to specify cudf::size_type (#12904) @davidwendt
Fix moto env vars & pass AWS_SESSION_TOKEN to conda builds (#12902) @ajschmidt8
Rewrite CSV writer benchmark with nvbench (#12901) @PointKernel
Rework some code logic to reduce iterator and comparator inlining to improve compile time (#12900) @davidwendt
Deprecate line_terminator in favor of lineterminator in to_csv (#12896) @wence-
Add stream and mr parameters for structs::detail::flatten_nested_columns (#12892) @ttnghia
Deprecate libcudf regex APIs accepting pattern strings directly (#12891) @davidwendt
Remove default parameters from detail headers in include (#12888) @vyasr
Update minimum pandas and numpy pinnings (#12887) @galipremsagar
Implement groupby.sample (#12882) @wence-
Update JNI build ENV default to gcc 11 (#12881) @pxLi
Change return type of cudf::structs::detail::flatten_nested_columns to smart pointer (#12878) @ttnghia
Fix passing seed parameter to MurmurHash3_32 in cudf::hash() function (#12875) @davidwendt
Remove manual artifact upload step in CI (#12869) @ajschmidt8
Update to GCC 11 (#12868) @bdice
Fix null hive-partition behavior in dask-cudf parquet (#12866) @rjzamora
Update to protobuf>=4.21.6,<4.22. (#12864) @bdice
Update RMM allocators (#12861) @pentschev
Improve performance for replace-multi for long strings (#12858) @davidwendt
Drop Python 3.7 handling for pickle protocol 4 (#12857) @jakirkham
Migrate as much as possible to pyproject.toml (#12850) @vyasr
Enable nbqa pre-commit hooks for isort and black. (#12848) @bdice
Setting a threshold for KvikIO IO (#12841) @madsbk
Update datasets download URL (#12840) @jjacobelli
Make docs builds less verbose (#12836) @AyodeAwe
Consolidate linter configs into pyproject.toml (#12834) @vyasr
Deprecate names & dtype in Index.copy (#12825) @galipremsagar
Deprecate inplace parameters in categorical methods (#12824) @galipremsagar
Add optional text file support to ninja-log utility (#12823) @davidwendt
Deprecate Index.is_* methods (#12820) @galipremsagar
Add dfg as a pre-commit hook (#12819) @vyasr
Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
Deprecate na_sentinel in factorize (#12817) @galipremsagar
Shuffling read into a sub function in parquet read (#12809) @hyperbolic2346
Fixing parquet coalescing of reads (#12808) @hyperbolic2346
CI: Remove specification of manual stage for check_style.sh script. (#12803) @csadorf
Add compute-sanitizer github workflow action to nightly tests (#12800) @davidwendt
Enable groupby std and variance aggregation types in libcudf Debug build (#12799) @davidwendt
Expose seed argument to hash_values (#12795) @ayushdg
Fix groupby gtests coded in namespace cudf::test (#12784) @davidwendt
Improve performance for cudf::strings::count_characters for long strings (#12779) @davidwendt
Deallocate encoded data in ORC writer immediately after compression (#12770) @vuule
Stop force pulling fmt in nvbench. (#12768) @vyasr
Remove now redundant cuda initialization (#12758) @vyasr
Adds JSON reader, writer io benchmark (#12753) @karthikeyann
Use test paths relative to package directory. (#12751) @bdice
Add build metrics report as artifact to cpp-build workflow (#12750) @davidwendt
Add JNI methods for detecting and purging non-empty nulls from LIST and STRUCT (#12742) @razajafri
Stop using versioneer to manage versions (#12741) @vyasr
Reduce error handling verbosity in CI tests scripts (#12738) @AjayThorve
Reduce the number of test cases in multibyte_split benchmark (#12737) @PointKernel
Update shared workflow branches (#12733) @ajschmidt8
JNI switches to nested JSON reader (#12732) @res-life
Changing cudf::io::source_info to use cudf::host_span<std::byte> in a non-breaking form (#12730) @hyperbolic2346
Add nvbench environment class for initializing RMM in benchmarks (#12728) @davidwendt
Split C++ and Python build dependencies into separate lists. (#12724) @bdice
Add build dependencies to Java tests. (#12723) @bdice
Allow setting the seed argument for hash partition (#12715) @firestarman
Remove gpuCI scripts. (#12712) @bdice
Unpin dask and distributed for development (#12710) @galipremsagar
partition_by_hash(): use _split() (#12704) @madsbk
Remove DataFrame.quantiles from docs. (#12684) @bdice
Fast path for experimental::row::equality (#12676) @divyegala
Move date to build string in conda recipe (#12661) @ajschmidt8
Refactor reduction logic for fixed-point types (#12652) @davidwendt
Pay off some JNI RMM API tech debt (#12632) @revans2
Merge copy-on-write feature branch into branch-23.04 (#12619) @galipremsagar
Remove cudf::strings::repeatstringsoutputsizes and optional parameter from cudf::strings::repeatstrings (#12609) @davidwendt
Pin cuda-nvrtc. (#12606) @bdice
Remove cudf::test::print calls from libcudf gtests (#12604) @davidwendt
Init JNI version 23.04.0-SNAPSHOT (#12599) @pxLi
Add performance benchmarks to user facing docs (#12595) @galipremsagar
Add docs build job (#12592) @AyodeAwe
Replace message parsing with throwing more specific exceptions (#12426) @vyasr
Support conversion to/from cudf in dask.dataframe.core.to_backend (#12380) @rjzamora

- C++
Published by raydouglass about 3 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v23.02.00

🔗 Links

🚨 Breaking Changes

Pin dask and distributed for release (#12695) @galipremsagar
Change ways to access ptr in Buffer (#12587) @galipremsagar
Remove column names (#12578) @vuule
Default cudf::io::read_json to nested JSON parser (#12544) @vuule
Switch engine=cudf to the new JSON reader (#12509) @galipremsagar
Add trailing comma support for nested JSON reader (#12448) @karthikeyann
Upgrade to arrow-10.0.1 (#12327) @galipremsagar
Fail loudly to avoid data corruption with unsupported input in read_orc (#12325) @vuule
CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
Remove deprecated code for 23.02 (#12281) @vyasr
Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
Purge non-empty nulls for superimpose_nulls and push_down_nulls (#12239) @ttnghia
Rename cudf::structs::detail::superimpose_parent_nulls APIs (#12230) @ttnghia
Remove JIT type names, refactor idtotype. (#12158) @bdice
Floor division uses integer division for integral arguments (#12131) @wence-

🐛 Bug Fixes

Fix update-version.sh (#12745) @raydouglass
Fix a mask data corruption in UDF (#12647) @galipremsagar
pre-commit: Update isort version to 5.12.0 (#12645) @wence-
tests: Skip cuInit tests if cuda-gdb is not found or not working (#12644) @wence-
Revert regex program java APIs and tests (#12639) @cindyyuanjiang
Fix leaks in ColumnVectorTest (#12625) @jlowe
Handle when spillable buffers own each other (#12607) @madsbk
Fix incorrect null counts for sliced columns in JCudfSerialization (#12589) @jlowe
lists: Transfer dtypes correctly through list.get (#12586) @wence-
timedelta: Don't go via float intermediates for floordiv (#12585) @wence-
Fixing BUG, get_next_chunk() should use the blocking function device_read() (#12584) @madsbk
Make JNI QuoteStyle accessible outside ai.rapids.cudf (#12572) @mythrocks
partition_by_hash(): support index (#12554) @madsbk
Mixed Join benchmark bug due to wrong conditional column (#12553) @divyegala
Update List Lexicographical Comparator (#12538) @divyegala
Dynamically read PTX version (#12534) @brandon-b-miller
build.sh switch to use RAPIDS magic value (#12525) @robertmaynard
Loosen runtime arrow pinning (#12522) @vyasr
Enable metadata transfer for complex types in transpose (#12491) @galipremsagar
Fix issues with parquet chunked reader (#12488) @nvdbaranec
Fix missing metadata transfer in concat for ListColumn (#12487) @galipremsagar
Rename libcudf substring source files to slice (#12484) @davidwendt
Fix compile issue with arrow 10 (#12465) @ttnghia
Fix List offsets bug in mixed type list column in nested JSON reader (#12447) @karthikeyann
Fix xfail incompatibilities (#12423) @vyasr
Fix bug in Parquet column index encoding (#12404) @etseidl
When building Arrow shared look for a shared OpenSSL (#12396) @robertmaynard
Fix getjsonobject to return empty column on empty input (#12384) @davidwendt
Pin arrow 9 in testing dependencies to prevent conda solve issues (#12377) @vyasr
Fix reductions any/all return value for empty input (#12374) @davidwendt
Fix debug compile errors in parquet.hpp (#12372) @davidwendt
Purge non-empty nulls in cudf::make_lists_column (#12370) @ttnghia
Use correct memory resource in io::make_column (#12364) @vyasr
Add code to detect possible malformed page data in parquet files. (#12360) @nvdbaranec
Fail loudly to avoid data corruption with unsupported input in read_orc (#12325) @vuule
Fix NumericPairIteratorTest for float values (#12306) @davidwendt
Fixes memory allocation in nested JSON tokenizer (#12300) @elstehle
Reconstruct dtypes correctly for list aggs of struct columns (#12290) @wence-
Fix regex \A and \Z to strictly match string begin/end (#12282) @davidwendt
Fix compile issue in json_chunked_reader.cpp (#12280) @ttnghia
Change reductions any/all to return valid values for empty input (#12279) @davidwendt
Only exclude join keys that are indices from key columns (#12271) @wence-
Fix spill to device limit (#12252) @madsbk
Correct behaviour of sort in concat for singleton concatenations (#12247) @wence-
Purge non-empty nulls for superimpose_nulls and push_down_nulls (#12239) @ttnghia
Patch CUB DeviceSegmentedSort and remove workaround (#12234) @davidwendt
Fix memory leak in udf_string::assign(&&) function (#12206) @davidwendt
Workaround thrust-copy-if limit in json gettreerepresentation (#12190) @davidwendt
Fix page size calculation in Parquet writer (#12182) @etseidl
Add cudf::detail::sizestooffsets_iterator to allow checking overflow in offsets (#12180) @davidwendt
Workaround thrust-copy-if limit in wordpiece-tokenizer (#12168) @davidwendt
Floor division uses integer division for integral arguments (#12131) @wence-

📖 Documentation

Fix link to NVTX (#12598) @sameerz
Include missing groupby functions in documentation (#12580) @quasiben
Fix documentation author (#12527) @bdice
Update libcudf reduction docs for casting output types (#12526) @davidwendt
Add JSON reader page in user guide (#12499) @GregoryKimball
Link unsupported iteration API docstrings (#12482) @galipremsagar
strings_udf doc update (#12469) @brandon-b-miller
Update cudf_assert docs with correct NDEBUG behavior (#12464) @robertmaynard
Update pre-commit hooks guide (#12395) @bdice
Update test docs to not use detail comparison utilities (#12332) @PointKernel
Fix doxygen description for regexprogram::computeworkingmemorysize (#12329) @davidwendt
Add eval to docs. (#12322) @vyasr
Turn on xfail_strict=true (#12244) @wence-
Update 10 minutes to cuDF (#12114) @wence-

🚀 New Features

Use kvikIO as the default IO backend (#12574) @vuule
Use has_nonempty_nulls instead of may_contain_non_empty_nulls in superimpose_nulls and push_down_nulls (#12560) @ttnghia
Add strings methods removeprefix and removesuffix (#12557) @davidwendt
Add regex_program java APIs and unit tests (#12548) @cindyyuanjiang
Default cudf::io::read_json to nested JSON parser (#12544) @vuule
Make string quoting optional on CSV write (#12539) @mythrocks
Use new nvCOMP API to optimize the compression temp memory size (#12533) @vuule
Support "values" orient (array of arrays) in Nested JSON reader (#12498) @karthikeyann
one_hot_encode to use experimental row comparators (#12478) @divyegala
Support %W and %w format specifiers in cudf::strings::to_timestamps (#12475) @davidwendt
Add JSON Writer (#12474) @karthikeyann
Refactor thrust_copy_if into cudf::detail::copy_if_safe (#12455) @ttnghia
Add trailing comma support for nested JSON reader (#12448) @karthikeyann
Extract tokenize_json.hpp detail header from src/io/json/nested_json.hpp (#12432) @ttnghia
JNI bindings to write CSV (#12425) @mythrocks
Nested JSON depth benchmark (#12371) @karthikeyann
Implement lists::reverse (#12336) @ttnghia
Use device_read in experimental read_json (#12314) @vuule
Implement JNI for strings::reverse (#12283) @ttnghia
Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
Add cudf::strings:like function with multiple patterns (#12269) @davidwendt
Add environment variable to control host memory allocation in hostdevice_vector (#12251) @vuule
Add cudf::strings::reverse function (#12227) @davidwendt
Selectively use dictionary encoding in Parquet writer (#12211) @etseidl
Support replace in strings_udf (#12207) @brandon-b-miller
Add support to read binary encoded decimals in parquet (#12205) @PointKernel
Support regex EOL where the string ends with a new-line character (#12181) @davidwendt
Updating stream_compaction/unique to use new row comparators (#12159) @divyegala
Add device buffer datasource (#12024) @PointKernel
Implement groupby apply with JIT (#11452) @bwyogatama

🛠️ Improvements

Update shared workflow branches (#12696) @ajschmidt8
Pin dask and distributed for release (#12695) @galipremsagar
Don't upload libcudf-example to Anaconda.org (#12671) @ajschmidt8
Pin wheel dependencies to same RAPIDS release (#12659) @sevagh
Use CTK 118/cp310 branch of wheel workflows (#12602) @sevagh
Change ways to access ptr in Buffer (#12587) @galipremsagar
Version a parquet writer xfail (#12579) @galipremsagar
Remove column names (#12578) @vuule
Parquet reader optimization to address V100 regression. (#12577) @nvdbaranec
Add support for category dtypes in CSV reader (#12571) @galipremsagar
Remove spill_lock parameter from SpillableBuffer.get_ptr() (#12564) @madsbk
Optimize cudf::make_lists_column (#12547) @ttnghia
Remove cudf::strings::repeat_strings_output_sizes from Java and JNI (#12546) @ttnghia
Test that cuInit is not called when RAPIDSNOINITIALIZE is set (#12545) @wence-
Rework repeat_strings to use sizes-to-offsets utility (#12543) @davidwendt
Replace exclusivescan with sizesto_offsets in cudf::lists::sequences (#12541) @davidwendt
Rework nvtext::ngrams_tokenize to use sizes-to-offsets utility (#12540) @davidwendt
Fix binary-ops gtests coded in namespace cudf::test (#12536) @davidwendt
More @acquire_spill_lock() and as_buffer(..., exposed=False) (#12535) @madsbk
Guard CUDA runtime APIs with error checking (#12531) @PointKernel
Update TODOs from issue 10432. (#12528) @bdice
Update rapids-cmake definitions version in GitHub Actions style checks. (#12511) @bdice
Switch engine=cudf to the new JSON reader (#12509) @galipremsagar
Fix SUM/MEAN aggregation type support. (#12503) @bdice
Stop using pandas._testing (#12492) @vyasr
Fix ROLLING_TEST gtests coded in namespace cudf::test (#12490) @davidwendt
Fix erroneously skipped ORC ZSTD test (#12486) @vuule
Rework nvtext::generatecharacterngrams to use makestringschildren (#12480) @davidwendt
Raise warnings as errors in the test suite (#12468) @vyasr
Remove int32 hard-coding in python (#12467) @galipremsagar
Use cudaMemcpyDefault. (#12466) @bdice
Update workflows for nightly tests (#12462) @ajschmidt8
Build CUDA 11.8 and Python 3.10 Packages (#12457) @ajschmidt8
JNI build image default as cuda11.8 (#12441) @pxLi
Re-enable Recently Updated Check (#12435) @ajschmidt8
Rework remaining cudf::strings::fromxyz functions to use makestrings_children (#12434) @vuule
Build wheels alongside conda CI (#12427) @sevagh
Remove arguments for checking exception messages in Python (#12424) @vyasr
Clean up cuco usage (#12421) @PointKernel
Fix warnings in remaining modules (#12406) @vyasr
Update ops-bot.yaml (#12402) @ajschmidt8
Rework cudf::strings::integerstoipv4 to use makestringschildren utility (#12401) @davidwendt
Use numpy.empty() instead of bytearray to allocate host memory for spilling (#12399) @madsbk
Deprecate chunksize from daskcudf.readcsv (#12394) @rjzamora
Expose the RMM pool size in JNI (#12390) @revans2
Fix COPYING_TEST: gtests coded in namespace cudf::test (#12387) @davidwendt
Rework cudf::strings::urlencode to use makestrings_children utility (#12385) @davidwendt
Use makestringschildren in parse_data nested json reader (#12382) @karthikeyann
Fix warnings in test_datetime.py (#12381) @vyasr
Mixed Join Benchmarks (#12375) @divyegala
Fix warnings in dataframe.py (#12369) @vyasr
Update conda recipes. (#12368) @bdice
Use gpu-latest-1 runner tag (#12366) @bdice
Rework cudf::strings::frombooleans to use makestrings_children (#12365) @vuule
Fix warnings in test modules up to test_dataframe.py (#12355) @vyasr
JSON column performance optimization - struct column nulls (#12354) @karthikeyann
Accelerate stable-segmented-sort with CUB segmented sort (#12347) @davidwendt
Add size check to makeoffsetschild_column utility (#12345) @davidwendt
Enable max compression ratio small block optimization for ZSTD (#12338) @vuule
Fix warnings in test_monotonic.py (#12334) @vyasr
Improve JSON column creation performance (list offsets) (#12330) @karthikeyann
Upgrade to arrow-10.0.1 (#12327) @galipremsagar
Fix warnings in test_orc.py (#12326) @vyasr
Fix warnings in test_groupby.py (#12324) @vyasr
Fix test_notebooks.sh (#12323) @ajschmidt8
Fix transform gtests coded in namespace cudf::test (#12321) @davidwendt
Fix check_style.sh script (#12320) @ajschmidt8
Rework cudf::strings::fromtimestamps to use makestrings_children (#12317) @davidwendt
Fix warnings in test_index.py (#12313) @vyasr
Fix warnings in test_multiindex.py (#12310) @vyasr
CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
Fix warnings in test_indexing.py (#12305) @vyasr
Fix warnings in test_joining.py (#12304) @vyasr
Unpin dask and distributed for development (#12302) @galipremsagar
Re-enable sccache for Jenkins builds (#12297) @ajschmidt8
Define needs for pr-builder workflow. (#12296) @bdice
Forward merge 22.12 into 23.02 (#12294) @vyasr
Fix warnings in test_stats.py (#12293) @vyasr
Fix table gtests coded in namespace cudf::test (#12292) @davidwendt
Change cython for regex calls to use cudf::strings::regex_program (#12289) @davidwendt
Improved error reporting when reading multiple JSON files (#12285) @vuule
Deprecate Frame.sumofsquares (#12284) @vyasr
Remove deprecated code for 23.02 (#12281) @vyasr
Clean up handling of maxpagesize_bytes in Parquet writer (#12277) @etseidl
Fix replace gtests coded in namespace cudf::test (#12270) @davidwendt
Add pandas nullable type support in Index.to_pandas (#12268) @galipremsagar
Rework nvtext::detokenize to use indexalator for row indices (#12267) @davidwendt
Fix reduction gtests coded in namespace cudf::test (#12257) @davidwendt
Remove default parameters from cudf::detail::sort function declarations (#12254) @davidwendt
Add duplicated support for Series, DataFrame and Index (#12246) @galipremsagar
Replace column/table test utilities with macros (#12242) @PointKernel
Rework cudf::strings::pad and zfill to use makestringschildren (#12238) @davidwendt
Fix sort gtests coded in namespace cudf::test (#12237) @davidwendt
Wrapping concat and file writes in @acquire_spill_lock() (#12232) @madsbk
Rename cudf::structs::detail::superimpose_parent_nulls APIs (#12230) @ttnghia
Cover parsing to decimal types in read_json tests (#12229) @vuule
Spill Statistics (#12223) @madsbk
Use CUDFJNIENABLE_PROFILING to conditionally enable profiling support. (#12221) @bdice
Clean up of test_spilling.py (#12220) @madsbk
Simplify repetitive boolean logic (#12218) @vuule
Add Series.hasnans and Index.hasnans (#12214) @galipremsagar
Add cudf::strings:udf::replace function (#12210) @davidwendt
Adds in new java APIs for appending byte arrays to host columnar data (#12208) @revans2
Remove Python dependencies from Java CI. (#12193) @bdice
Fix null order in sort-based groupby and improve groupby tests (#12191) @divyegala
Move strings children functions from cudf/strings/detail/utilities.cuh to new header (#12185) @davidwendt
Clean up existing JNI scalar to column code (#12173) @revans2
Remove JIT type names, refactor idtotype. (#12158) @bdice
Update JNI version to 23.02.0-SNAPSHOT (#12129) @pxLi
Minor refactor of cpp/src/io/parquet/page_data.cu (#12126) @etseidl
Add codespell as a linter (#12097) @benfred
Enable specifying exceptions in error macros (#12078) @vyasr
Move _label_encoding from Series to Column (#12040) @shwina
Add GitHub Actions Workflows (#12002) @ajschmidt8
Consolidate dask-cudf groupby_agg calls in one place (#10835) @charlesbluca

- C++
Published by rapids-bot[bot] over 3 years ago

https://github.com/rapidsai/cudf - v23.02.00

🚨 Breaking Changes

Pin dask and distributed for release (#12695) @galipremsagar
Change ways to access ptr in Buffer (#12587) @galipremsagar
Remove column names (#12578) @vuule
Default cudf::io::read_json to nested JSON parser (#12544) @vuule
Switch engine=cudf to the new JSON reader (#12509) @galipremsagar
Add trailing comma support for nested JSON reader (#12448) @karthikeyann
Upgrade to arrow-10.0.1 (#12327) @galipremsagar
Fail loudly to avoid data corruption with unsupported input in read_orc (#12325) @vuule
CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
Remove deprecated code for 23.02 (#12281) @vyasr
Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
Purge non-empty nulls for superimpose_nulls and push_down_nulls (#12239) @ttnghia
Rename cudf::structs::detail::superimpose_parent_nulls APIs (#12230) @ttnghia
Remove JIT type names, refactor idtotype. (#12158) @bdice
Floor division uses integer division for integral arguments (#12131) @wence-

🐛 Bug Fixes

Fix a mask data corruption in UDF (#12647) @galipremsagar
pre-commit: Update isort version to 5.12.0 (#12645) @wence-
tests: Skip cuInit tests if cuda-gdb is not found or not working (#12644) @wence-
Revert regex program java APIs and tests (#12639) @cindyyuanjiang
Fix leaks in ColumnVectorTest (#12625) @jlowe
Handle when spillable buffers own each other (#12607) @madsbk
Fix incorrect null counts for sliced columns in JCudfSerialization (#12589) @jlowe
lists: Transfer dtypes correctly through list.get (#12586) @wence-
timedelta: Don't go via float intermediates for floordiv (#12585) @wence-
Fixing BUG, get_next_chunk() should use the blocking function device_read() (#12584) @madsbk
Make JNI QuoteStyle accessible outside ai.rapids.cudf (#12572) @mythrocks
partition_by_hash(): support index (#12554) @madsbk
Mixed Join benchmark bug due to wrong conditional column (#12553) @divyegala
Update List Lexicographical Comparator (#12538) @divyegala
Dynamically read PTX version (#12534) @brandon-b-miller
build.sh switch to use RAPIDS magic value (#12525) @robertmaynard
Loosen runtime arrow pinning (#12522) @vyasr
Enable metadata transfer for complex types in transpose (#12491) @galipremsagar
Fix issues with parquet chunked reader (#12488) @nvdbaranec
Fix missing metadata transfer in concat for ListColumn (#12487) @galipremsagar
Rename libcudf substring source files to slice (#12484) @davidwendt
Fix compile issue with arrow 10 (#12465) @ttnghia
Fix List offsets bug in mixed type list column in nested JSON reader (#12447) @karthikeyann
Fix xfail incompatibilities (#12423) @vyasr
Fix bug in Parquet column index encoding (#12404) @etseidl
When building Arrow shared look for a shared OpenSSL (#12396) @robertmaynard
Fix getjsonobject to return empty column on empty input (#12384) @davidwendt
Pin arrow 9 in testing dependencies to prevent conda solve issues (#12377) @vyasr
Fix reductions any/all return value for empty input (#12374) @davidwendt
Fix debug compile errors in parquet.hpp (#12372) @davidwendt
Purge non-empty nulls in cudf::make_lists_column (#12370) @ttnghia
Use correct memory resource in io::make_column (#12364) @vyasr
Add code to detect possible malformed page data in parquet files. (#12360) @nvdbaranec
Fail loudly to avoid data corruption with unsupported input in read_orc (#12325) @vuule
Fix NumericPairIteratorTest for float values (#12306) @davidwendt
Fixes memory allocation in nested JSON tokenizer (#12300) @elstehle
Reconstruct dtypes correctly for list aggs of struct columns (#12290) @wence-
Fix regex \A and \Z to strictly match string begin/end (#12282) @davidwendt
Fix compile issue in json_chunked_reader.cpp (#12280) @ttnghia
Change reductions any/all to return valid values for empty input (#12279) @davidwendt
Only exclude join keys that are indices from key columns (#12271) @wence-
Fix spill to device limit (#12252) @madsbk
Correct behaviour of sort in concat for singleton concatenations (#12247) @wence-
Purge non-empty nulls for superimpose_nulls and push_down_nulls (#12239) @ttnghia
Patch CUB DeviceSegmentedSort and remove workaround (#12234) @davidwendt
Fix memory leak in udf_string::assign(&&) function (#12206) @davidwendt
Workaround thrust-copy-if limit in json gettreerepresentation (#12190) @davidwendt
Fix page size calculation in Parquet writer (#12182) @etseidl
Add cudf::detail::sizestooffsets_iterator to allow checking overflow in offsets (#12180) @davidwendt
Workaround thrust-copy-if limit in wordpiece-tokenizer (#12168) @davidwendt
Floor division uses integer division for integral arguments (#12131) @wence-

📖 Documentation

Fix link to NVTX (#12598) @sameerz
Include missing groupby functions in documentation (#12580) @quasiben
Fix documentation author (#12527) @bdice
Update libcudf reduction docs for casting output types (#12526) @davidwendt
Add JSON reader page in user guide (#12499) @GregoryKimball
Link unsupported iteration API docstrings (#12482) @galipremsagar
strings_udf doc update (#12469) @brandon-b-miller
Update cudf_assert docs with correct NDEBUG behavior (#12464) @robertmaynard
Update pre-commit hooks guide (#12395) @bdice
Update test docs to not use detail comparison utilities (#12332) @PointKernel
Fix doxygen description for regexprogram::computeworkingmemorysize (#12329) @davidwendt
Add eval to docs. (#12322) @vyasr
Turn on xfail_strict=true (#12244) @wence-
Update 10 minutes to cuDF (#12114) @wence-

🚀 New Features

Use kvikIO as the default IO backend (#12574) @vuule
Use has_nonempty_nulls instead of may_contain_non_empty_nulls in superimpose_nulls and push_down_nulls (#12560) @ttnghia
Add strings methods removeprefix and removesuffix (#12557) @davidwendt
Add regex_program java APIs and unit tests (#12548) @cindyyuanjiang
Default cudf::io::read_json to nested JSON parser (#12544) @vuule
Make string quoting optional on CSV write (#12539) @mythrocks
Use new nvCOMP API to optimize the compression temp memory size (#12533) @vuule
Support "values" orient (array of arrays) in Nested JSON reader (#12498) @karthikeyann
one_hot_encode to use experimental row comparators (#12478) @divyegala
Support %W and %w format specifiers in cudf::strings::to_timestamps (#12475) @davidwendt
Add JSON Writer (#12474) @karthikeyann
Refactor thrust_copy_if into cudf::detail::copy_if_safe (#12455) @ttnghia
Add trailing comma support for nested JSON reader (#12448) @karthikeyann
Extract tokenize_json.hpp detail header from src/io/json/nested_json.hpp (#12432) @ttnghia
JNI bindings to write CSV (#12425) @mythrocks
Nested JSON depth benchmark (#12371) @karthikeyann
Implement lists::reverse (#12336) @ttnghia
Use device_read in experimental read_json (#12314) @vuule
Implement JNI for strings::reverse (#12283) @ttnghia
Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
Add cudf::strings:like function with multiple patterns (#12269) @davidwendt
Add environment variable to control host memory allocation in hostdevice_vector (#12251) @vuule
Add cudf::strings::reverse function (#12227) @davidwendt
Selectively use dictionary encoding in Parquet writer (#12211) @etseidl
Support replace in strings_udf (#12207) @brandon-b-miller
Add support to read binary encoded decimals in parquet (#12205) @PointKernel
Support regex EOL where the string ends with a new-line character (#12181) @davidwendt
Updating stream_compaction/unique to use new row comparators (#12159) @divyegala
Add device buffer datasource (#12024) @PointKernel
Implement groupby apply with JIT (#11452) @bwyogatama

🛠️ Improvements

Update shared workflow branches (#12696) @ajschmidt8
Pin dask and distributed for release (#12695) @galipremsagar
Don't upload libcudf-example to Anaconda.org (#12671) @ajschmidt8
Pin wheel dependencies to same RAPIDS release (#12659) @sevagh
Use CTK 118/cp310 branch of wheel workflows (#12602) @sevagh
Change ways to access ptr in Buffer (#12587) @galipremsagar
Version a parquet writer xfail (#12579) @galipremsagar
Remove column names (#12578) @vuule
Parquet reader optimization to address V100 regression. (#12577) @nvdbaranec
Add support for category dtypes in CSV reader (#12571) @galipremsagar
Remove spill_lock parameter from SpillableBuffer.get_ptr() (#12564) @madsbk
Optimize cudf::make_lists_column (#12547) @ttnghia
Remove cudf::strings::repeat_strings_output_sizes from Java and JNI (#12546) @ttnghia
Test that cuInit is not called when RAPIDSNOINITIALIZE is set (#12545) @wence-
Rework repeat_strings to use sizes-to-offsets utility (#12543) @davidwendt
Replace exclusivescan with sizesto_offsets in cudf::lists::sequences (#12541) @davidwendt
Rework nvtext::ngrams_tokenize to use sizes-to-offsets utility (#12540) @davidwendt
Fix binary-ops gtests coded in namespace cudf::test (#12536) @davidwendt
More @acquire_spill_lock() and as_buffer(..., exposed=False) (#12535) @madsbk
Guard CUDA runtime APIs with error checking (#12531) @PointKernel
Update TODOs from issue 10432. (#12528) @bdice
Update rapids-cmake definitions version in GitHub Actions style checks. (#12511) @bdice
Switch engine=cudf to the new JSON reader (#12509) @galipremsagar
Fix SUM/MEAN aggregation type support. (#12503) @bdice
Stop using pandas._testing (#12492) @vyasr
Fix ROLLING_TEST gtests coded in namespace cudf::test (#12490) @davidwendt
Fix erroneously skipped ORC ZSTD test (#12486) @vuule
Rework nvtext::generatecharacterngrams to use makestringschildren (#12480) @davidwendt
Raise warnings as errors in the test suite (#12468) @vyasr
Remove int32 hard-coding in python (#12467) @galipremsagar
Use cudaMemcpyDefault. (#12466) @bdice
Update workflows for nightly tests (#12462) @ajschmidt8
Build CUDA 11.8 and Python 3.10 Packages (#12457) @ajschmidt8
JNI build image default as cuda11.8 (#12441) @pxLi
Re-enable Recently Updated Check (#12435) @ajschmidt8
Rework remaining cudf::strings::fromxyz functions to use makestrings_children (#12434) @vuule
Build wheels alongside conda CI (#12427) @sevagh
Remove arguments for checking exception messages in Python (#12424) @vyasr
Clean up cuco usage (#12421) @PointKernel
Fix warnings in remaining modules (#12406) @vyasr
Update ops-bot.yaml (#12402) @ajschmidt8
Rework cudf::strings::integerstoipv4 to use makestringschildren utility (#12401) @davidwendt
Use numpy.empty() instead of bytearray to allocate host memory for spilling (#12399) @madsbk
Deprecate chunksize from daskcudf.readcsv (#12394) @rjzamora
Expose the RMM pool size in JNI (#12390) @revans2
Fix COPYING_TEST: gtests coded in namespace cudf::test (#12387) @davidwendt
Rework cudf::strings::urlencode to use makestrings_children utility (#12385) @davidwendt
Use makestringschildren in parse_data nested json reader (#12382) @karthikeyann
Fix warnings in test_datetime.py (#12381) @vyasr
Mixed Join Benchmarks (#12375) @divyegala
Fix warnings in dataframe.py (#12369) @vyasr
Update conda recipes. (#12368) @bdice
Use gpu-latest-1 runner tag (#12366) @bdice
Rework cudf::strings::frombooleans to use makestrings_children (#12365) @vuule
Fix warnings in test modules up to test_dataframe.py (#12355) @vyasr
JSON column performance optimization - struct column nulls (#12354) @karthikeyann
Accelerate stable-segmented-sort with CUB segmented sort (#12347) @davidwendt
Add size check to makeoffsetschild_column utility (#12345) @davidwendt
Enable max compression ratio small block optimization for ZSTD (#12338) @vuule
Fix warnings in test_monotonic.py (#12334) @vyasr
Improve JSON column creation performance (list offsets) (#12330) @karthikeyann
Upgrade to arrow-10.0.1 (#12327) @galipremsagar
Fix warnings in test_orc.py (#12326) @vyasr
Fix warnings in test_groupby.py (#12324) @vyasr
Fix test_notebooks.sh (#12323) @ajschmidt8
Fix transform gtests coded in namespace cudf::test (#12321) @davidwendt
Fix check_style.sh script (#12320) @ajschmidt8
Rework cudf::strings::fromtimestamps to use makestrings_children (#12317) @davidwendt
Fix warnings in test_index.py (#12313) @vyasr
Fix warnings in test_multiindex.py (#12310) @vyasr
CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
Fix warnings in test_indexing.py (#12305) @vyasr
Fix warnings in test_joining.py (#12304) @vyasr
Unpin dask and distributed for development (#12302) @galipremsagar
Re-enable sccache for Jenkins builds (#12297) @ajschmidt8
Define needs for pr-builder workflow. (#12296) @bdice
Forward merge 22.12 into 23.02 (#12294) @vyasr
Fix warnings in test_stats.py (#12293) @vyasr
Fix table gtests coded in namespace cudf::test (#12292) @davidwendt
Change cython for regex calls to use cudf::strings::regex_program (#12289) @davidwendt
Improved error reporting when reading multiple JSON files (#12285) @vuule
Deprecate Frame.sumofsquares (#12284) @vyasr
Remove deprecated code for 23.02 (#12281) @vyasr
Clean up handling of maxpagesize_bytes in Parquet writer (#12277) @etseidl
Fix replace gtests coded in namespace cudf::test (#12270) @davidwendt
Add pandas nullable type support in Index.to_pandas (#12268) @galipremsagar
Rework nvtext::detokenize to use indexalator for row indices (#12267) @davidwendt
Fix reduction gtests coded in namespace cudf::test (#12257) @davidwendt
Remove default parameters from cudf::detail::sort function declarations (#12254) @davidwendt
Add duplicated support for Series, DataFrame and Index (#12246) @galipremsagar
Replace column/table test utilities with macros (#12242) @PointKernel
Rework cudf::strings::pad and zfill to use makestringschildren (#12238) @davidwendt
Fix sort gtests coded in namespace cudf::test (#12237) @davidwendt
Wrapping concat and file writes in @acquire_spill_lock() (#12232) @madsbk
Rename cudf::structs::detail::superimpose_parent_nulls APIs (#12230) @ttnghia
Cover parsing to decimal types in read_json tests (#12229) @vuule
Spill Statistics (#12223) @madsbk
Use CUDFJNIENABLE_PROFILING to conditionally enable profiling support. (#12221) @bdice
Clean up of test_spilling.py (#12220) @madsbk
Simplify repetitive boolean logic (#12218) @vuule
Add Series.hasnans and Index.hasnans (#12214) @galipremsagar
Add cudf::strings:udf::replace function (#12210) @davidwendt
Adds in new java APIs for appending byte arrays to host columnar data (#12208) @revans2
Remove Python dependencies from Java CI. (#12193) @bdice
Fix null order in sort-based groupby and improve groupby tests (#12191) @divyegala
Move strings children functions from cudf/strings/detail/utilities.cuh to new header (#12185) @davidwendt
Clean up existing JNI scalar to column code (#12173) @revans2
Remove JIT type names, refactor idtotype. (#12158) @bdice
Update JNI version to 23.02.0-SNAPSHOT (#12129) @pxLi
Minor refactor of cpp/src/io/parquet/page_data.cu (#12126) @etseidl
Add codespell as a linter (#12097) @benfred
Enable specifying exceptions in error macros (#12078) @vyasr
Move _label_encoding from Series to Column (#12040) @shwina
Add GitHub Actions Workflows (#12002) @ajschmidt8
Consolidate dask-cudf groupby_agg calls in one place (#10835) @charlesbluca

- C++
Published by raydouglass over 3 years ago

https://github.com/rapidsai/cudf - v22.12.01

🚨 Breaking Changes

Add JNI for substring without 'end' parameter. (#12113) @firestarman
Refactor purge_nonempty_nulls (#12111) @ttnghia
Create an int8 column in read_csv when all elements are missing (#12110) @vuule
Throw an error when libcudf is built without cuFile and LIBCUDF_CUFILE_POLICY is set to "ALWAYS" (#12080) @vuule
Fix type promotion edge cases in numerical binops (#12074) @wence-
Reduce/Remove reliance on **kwargs and *args in IO readers & writers (#12025) @galipremsagar
Rollback of DeviceBufferLike (#12009) @madsbk
Remove unused managed_allocator (#12005) @vyasr
Pass column names to write_csv instead of table_metadata pointer (#11972) @vuule
Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
Default to equal NaNs in makemergesets_aggregation. (#11952) @bdice
Remove validation that requires introspection (#11938) @vyasr
Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
Support nested types as groupby keys in libcudf (#11792) @PointKernel
Default to equal NaNs in makecollectset_aggregation. (#11621) @bdice
Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source

🐛 Bug Fixes

strings_udf: use libcudf caching of character tables (#12343) @wence-
Fix include line for IO Cython modules (#12250) @vyasr
Make dask pinning looser (#12231) @vyasr
Workaround for CUB segmented-sort bug with boolean keys (#12217) @davidwendt
Fix from_dict backend dispatch to match upstream dask (#12203) @galipremsagar
Merge branch-22.10 into branch-22.12 (#12198) @davidwendt
Fix compression in ORC writer (#12194) @vuule
Don't use CMake 3.25.0 as it has a show stopping FindCUDAToolkit bug (#12188) @robertmaynard
Fix data corruption when reading ORC files with empty stripes (#12160) @vuule
Fix decimal binary operations (#12142) @galipremsagar
Ensure dlpack include is provided to cudf interop lib (#12139) @robertmaynard
Safely allocate udf_string pointers in strings_udf (#12138) @brandon-b-miller
Fix/disable jitify lto (#12122) @robertmaynard
Fix conditionalfulljoin benchmark (#12121) @GregoryKimball
Fix regex working-memory-size refactor error (#12119) @davidwendt
Add in negative size checks for columns (#12118) @revans2
Add JNI for substring without 'end' parameter. (#12113) @firestarman
Fix reading of CSV files with blank second row (#12098) @vuule
Fix an error in IO with GzipFile type (#12085) @galipremsagar
Workaround groupby aggregate thrust::copy_if overflow (#12079) @davidwendt
Fix alignment of compressed blocks in ORC writer (#12077) @vuule
Fix singleton-range __setitem__ edge case (#12075) @wence-
Fix type promotion edge cases in numerical binops (#12074) @wence-
Force using old fmt in nvbench. (#12067) @vyasr
Fixes List offset bug in Nested JSON reader (#12060) @karthikeyann
Allow falling back to shim_60.ptx by default in strings_udf (#12056) @brandon-b-miller
Force black exclusions for pre-commit. (#12036) @bdice
Add memory_usage & items implementation for Struct column & dtype (#12033) @galipremsagar
Reduce/Remove reliance on **kwargs and *args in IO readers & writers (#12025) @galipremsagar
Fixes bug in csvreaderoptions construction in cython (#12021) @karthikeyann
Fix issues when both usecols and names options are used in read_csv (#12018) @vuule
Port thrust's pinned_allocator to cudf, since Thrust 1.17 removes the type (#12004) @robertmaynard
Revert "Replace most of preprocessor usage in nvcomp adapter with constexpr" (#11999) @vuule
Fix bug where df.loc resulting in single row could give wrong index (#11998) @eriknw
Switch to DISABLEDEPRECATIONWARNINGS to match other RAPIDS projects (#11989) @robertmaynard
Fix maximum page size estimate in Parquet writer (#11962) @vuule
Fix local offset handling in bgzip reader (#11918) @upsj
Fix an issue reading struct-of-list types in Parquet. (#11910) @nvdbaranec
Fix memcheck error in TypeInference.Timestamp gtest (#11905) @davidwendt
Fix type casting in Series.setitem (#11904) @wence-
Fix memcheck error in getdremeldata (#11903) @davidwendt
Fixes Unsupported column type error due to empty list columns in Nested JSON reader (#11897) @karthikeyann
Fix segmented-sort to ignore indices outside the offsets (#11888) @davidwendt
Fix cudf::stablesortedorder for NaN and -NaN in FLOAT64 columns (#11874) @davidwendt
Fix writing of Parquet files with many fragments (#11869) @etseidl
Fix RangeIndex unary operators. (#11868) @vyasr
JNI Avoid NPE for reading host binary data (#11865) @revans2
Fix decimal benchmark input data generation (#11863) @karthikeyann
Fix pre-commit copyright check (#11860) @galipremsagar
Fix Parquet support for seconds and milliseconds duration types (#11854) @vuule
Ensure better compiler cache results between cudf cal-ver branches (#11835) @robertmaynard
Fix makecolumnfrom_scalar for all-null strings column (#11807) @davidwendt
Tell jitify_preprocess where to search for libnvrtc (#11787) @robertmaynard
add V2 page header support to parquet reader (#11778) @etseidl
Parquet reader: bug fix for a numrows/skiprows corner case, w/optimization for nested preprocessing (#11752) @nvdbaranec
Determine if Arrow has S3 support at runtime in unit test. (#11560) @bdice

📖 Documentation

Use rapidsai CODEOFCONDUCT.md (#12166) @bdice
Add symlinks to notebooks. (#12128) @bdice
Add truncate API to python doc pages (#12109) @galipremsagar
Update Numba docs links. (#12107) @bdice
Remove "Multi-GPU with Dask-cuDF" notebook. (#12095) @bdice
Fix link to c++ developer guide from CONTRIBUTING.md (#12084) @brandon-b-miller
Add pivot_table and crosstab to docs. (#12014) @bdice
Fix doxygen text for cudf::dictionary::encode (#11991) @davidwendt
Replace defaultstreamvalue with getdefaultstream in docs. (#11985) @vyasr
Add dtype docs pages and docstrings for cudf specific dtypes (#11974) @galipremsagar
Update Unit Testing in libcudf guidelines to code tests outside the cudf::test namespace (#11959) @davidwendt
Rename libcudf++ to libcudf. (#11953) @bdice
Fix documentation referring to removed asgpumatrix method. (#11937) @bdice
Remove "experimental" warning for struct columns in ORC reader and writer (#11880) @vuule
Initial draft of policies and guidelines for libcudf usage. (#11853) @vyasr
Add clear indication of non-GPU accelerated parameters in read_json docstring (#11825) @GregoryKimball
Add developer docs for writing tests (#11199) @vyasr

🚀 New Features

Adds an EventHandler to Java MemoryBuffer to be invoked on close (#12125) @abellina
Support + in strings_udf (#12117) @brandon-b-miller
Support upper and lower in strings_udf (#12099) @brandon-b-miller
Add wheel builds (#12096) @vyasr
Allow setting malloc heap size in string udfs (#12094) @brandon-b-miller
Support strip, lstrip, and rstrip in strings_udf (#12091) @brandon-b-miller
Mark nvcomp zstd compression stable (#12059) @jbrennan333
Add debug-only onAllocated/onDeallocated to RmmEventHandler (#12054) @abellina
Enable building against the libarrow contained in pyarrow (#12034) @vyasr
Add strings like jni and native method (#12032) @cindyyuanjiang
Cleanup common parsing code in JSON, CSV reader (#12022) @karthikeyann
byte_range support for JSON Lines format (#12017) @karthikeyann
Minor cleanup of root CMakeLists.txt for better organization (#11988) @robertmaynard
Add inplace arithmetic operators to MaskedType (#11987) @brandon-b-miller
Implement JNI for chunked Parquet reader (#11961) @ttnghia
Add method argument to DataFrame.quantile (#11957) @rjzamora
Add gpu memory watermark apis to JNI (#11950) @abellina
Adds retryCount to RmmEventHandler.onAllocFailure (#11940) @abellina
Enable returning string data from UDFs used through apply (#11933) @brandon-b-miller
Switch over to rapids-cmake patches for thrust (#11921) @robertmaynard
Add strings udf C++ classes and functions for phase II (#11912) @davidwendt
Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
Enable CEC for strings_udf (#11884) @brandon-b-miller
ArrowIPCTableWriter writes en empty batch in the case of an empty table. (#11883) @firestarman
Implement chunked Parquet reader (#11867) @ttnghia
Add read_orc_metadata to libcudf (#11815) @vuule
Support nested types as groupby keys in libcudf (#11792) @PointKernel
Adding feature Truncate to DataFrame and Series (#11435) @VamsiTallam95

🛠️ Improvements

Reduce number of tests marked spilling (#12197) @madsbk
Pin dask and distributed for release (#12165) @galipremsagar
Don't rely on GNU find in headers_test.sh (#12164) @wence-
Update cp.clip call (#12148) @quasiben
Enable automatic column projection in groupby().agg (#12124) @rjzamora
Refactor purge_nonempty_nulls (#12111) @ttnghia
Create an int8 column in read_csv when all elements are missing (#12110) @vuule
Spilling to host memory (#12106) @madsbk
First pass of pd.read_orc changes in tests (#12103) @galipremsagar
Expose engine argument in daskcudf.readjson (#12101) @rjzamora
Remove CUDA 10 compatibility code. (#12088) @bdice
Move and update dask nigthly install in CI (#12082) @galipremsagar
Throw an error when libcudf is built without cuFile and LIBCUDF_CUFILE_POLICY is set to "ALWAYS" (#12080) @vuule
Remove macros that inspect the contents of exceptions (#12076) @vyasr
Fix ingestrawdata performance issue in Nested JSON reader due to RVO (#12070) @karthikeyann
Remove overflow error during decimal binops (#12063) @galipremsagar
Change cudf::detail::tdigest to cudf::tdigest::detail (#12050) @davidwendt
Fix quantile gtests coded in namespace cudf::test (#12049) @davidwendt
Add support for DataFrame.from_dict`todictandSeries.todict` (#12048) @galipremsagar
Refactor Parquet reader (#12046) @ttnghia
Forward merge 22.10 into 22.12 (#12045) @vyasr
Standardize newlines at ends of files. (#12042) @bdice
Trim trailing whitespace from all files. (#12041) @bdice
Use nosync policy in gather and scatter implementations. (#12038) @bdice
Remove smart quotes from all docstrings. (#12035) @bdice
Update cuda-python dependency to 11.7.1 (#12030) @galipremsagar
Add cython-lint to pre-commit checks. (#12020) @bdice
Use pragma once (#12019) @bdice
New GHA to add issues/prs to project board (#12016) @jarmak-nv
Add DataFrame.pivot_table. (#12015) @bdice
Rollback of DeviceBufferLike (#12009) @madsbk
Remove default parameters for nvtext::detail functions (#12007) @davidwendt
Remove default parameters for cudf::dictionary::detail functions (#12006) @davidwendt
Remove unused managed_allocator (#12005) @vyasr
Remove default parameters for cudf::strings::detail functions (#12003) @davidwendt
Remove unnecessary code from dask-cudf _Frame (#12001) @rjzamora
Ignore python docs build artifacts (#12000) @galipremsagar
Use rapids-cmake for google benchmark. (#11997) @vyasr
Leverage rapids_cython for more automated RPATH handling (#11996) @vyasr
Remove stale labeler (#11995) @raydouglass
Move protobuf compilation to CMake (#11986) @vyasr
Replace most of preprocessor usage in nvcomp adapter with constexpr (#11980) @vuule
Add missing noexcepts to columninmetadata methods (#11973) @vyasr
Pass column names to write_csv instead of table_metadata pointer (#11972) @vuule
Accelerate libcudf segmented sort with CUB segmented sort (#11969) @davidwendt
Feature/remove default streams (#11967) @vyasr
Add pool memory resource to libcudf basic example (#11966) @davidwendt
Fix some libcudf calls to cudf::detail::gather (#11963) @davidwendt
Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
Add deprecation warning for set_allocator. (#11958) @vyasr
Fix lists and structs gtests coded in namespace cudf::test (#11956) @davidwendt
Add full page indexes to Parquet writer benchmarks (#11955) @etseidl
Use gather-based strings factory in cudf::strings::strip (#11954) @davidwendt
Default to equal NaNs in makemergesets_aggregation. (#11952) @bdice
Add strip_delimiters option to read_text (#11946) @upsj
Refactor multibytesplit `outputbuilder` (#11945) @upsj
Remove validation that requires introspection (#11938) @vyasr
Add .str.find_multiple API (#11928) @galipremsagar
Add regex_program class for use with all regex APIs (#11927) @davidwendt
Enable backend dispatching for Dask-DataFrame creation (#11920) @rjzamora
Performance improvement in JSON Tree traversal (#11919) @karthikeyann
Fix some gtests incorrectly coded in namespace cudf::test (part I) (#11917) @davidwendt
Refactor pad/zfill functions for reuse with strings udf (#11914) @davidwendt
Add nanosecond & microsecond to DatetimeProperties (#11911) @galipremsagar
Pin mimesis version in setup.py. (#11906) @bdice
Error on ListColumn or any new unsupported column in cudf.Index (#11902) @galipremsagar
Add thrust output iterator fix (1805) to thrust.patch (#11900) @davidwendt
Relax codecov threshold diff (#11899) @galipremsagar
Use public APIs in STREAMCOMPACTIONNVBENCH (#11892) @GregoryKimball
Add coverage for string UDF tests. (#11891) @vyasr
Provide data_chunk_source wrapper for datasource (#11886) @upsj
Handle multibyte_split byte_range out-of-bounds offsets on host (#11885) @upsj
Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
Change expectstringsempty into expectcolumnempty libcudf test utility (#11873) @davidwendt
Add ngroup (#11871) @shwina
Reduce memory usage in nested JSON parser - tree generation (#11864) @karthikeyann
Unpin dask and distributed for development (#11859) @galipremsagar
Remove unused includes for table/row_operators (#11857) @GregoryKimball
Use conda-forge's pyorc (#11855) @jakirkham
Add libcudf strings examples (#11849) @davidwendt
Remove cudf_io namespace alias (#11827) @vuule
Test/remove thrust vector usage (#11813) @vyasr
Add BGZIP reader to python read_text (#11802) @upsj
Merge branch-22.10 into branch-22.12 (#11801) @davidwendt
Fix compile warning from CUDFFUNCRANGE in a member function (#11798) @davidwendt
Update cudf JNI version to 22.12.0-SNAPSHOT (#11764) @pxLi
Update flake8 to 5.0.4 and use flake8-force to check Cython. (#11736) @bdice
Add BGZIP multibyte_split benchmark (#11723) @upsj
Bifurcate Dependency Lists (#11674) @bdice
Default to equal NaNs in makecollectset_aggregation. (#11621) @bdice
Conform "bench_isin" to match generator column names (#11549) @GregoryKimball
Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
Add checks for HLG layers in dask-cudf groupby tests (#10853) @charlesbluca
part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
Make all nvcc warnings into errors (#8916) @trxcllnt

- C++
Published by GPUtester over 3 years ago

https://github.com/rapidsai/cudf - v22.12.00

🚨 Breaking Changes

Add JNI for substring without 'end' parameter. (#12113) @firestarman
Refactor purge_nonempty_nulls (#12111) @ttnghia
Create an int8 column in read_csv when all elements are missing (#12110) @vuule
Throw an error when libcudf is built without cuFile and LIBCUDF_CUFILE_POLICY is set to "ALWAYS" (#12080) @vuule
Fix type promotion edge cases in numerical binops (#12074) @wence-
Reduce/Remove reliance on **kwargs and *args in IO readers & writers (#12025) @galipremsagar
Rollback of DeviceBufferLike (#12009) @madsbk
Remove unused managed_allocator (#12005) @vyasr
Pass column names to write_csv instead of table_metadata pointer (#11972) @vuule
Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
Default to equal NaNs in makemergesets_aggregation. (#11952) @bdice
Remove validation that requires introspection (#11938) @vyasr
Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
Support nested types as groupby keys in libcudf (#11792) @PointKernel
Default to equal NaNs in makecollectset_aggregation. (#11621) @bdice
Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source

🐛 Bug Fixes

Fix include line for IO Cython modules (#12250) @vyasr
Make dask pinning looser (#12231) @vyasr
Workaround for CUB segmented-sort bug with boolean keys (#12217) @davidwendt
Fix from_dict backend dispatch to match upstream dask (#12203) @galipremsagar
Merge branch-22.10 into branch-22.12 (#12198) @davidwendt
Fix compression in ORC writer (#12194) @vuule
Don't use CMake 3.25.0 as it has a show stopping FindCUDAToolkit bug (#12188) @robertmaynard
Fix data corruption when reading ORC files with empty stripes (#12160) @vuule
Fix decimal binary operations (#12142) @galipremsagar
Ensure dlpack include is provided to cudf interop lib (#12139) @robertmaynard
Safely allocate udf_string pointers in strings_udf (#12138) @brandon-b-miller
Fix/disable jitify lto (#12122) @robertmaynard
Fix conditionalfulljoin benchmark (#12121) @GregoryKimball
Fix regex working-memory-size refactor error (#12119) @davidwendt
Add in negative size checks for columns (#12118) @revans2
Add JNI for substring without 'end' parameter. (#12113) @firestarman
Fix reading of CSV files with blank second row (#12098) @vuule
Fix an error in IO with GzipFile type (#12085) @galipremsagar
Workaround groupby aggregate thrust::copy_if overflow (#12079) @davidwendt
Fix alignment of compressed blocks in ORC writer (#12077) @vuule
Fix singleton-range __setitem__ edge case (#12075) @wence-
Fix type promotion edge cases in numerical binops (#12074) @wence-
Force using old fmt in nvbench. (#12067) @vyasr
Fixes List offset bug in Nested JSON reader (#12060) @karthikeyann
Allow falling back to shim_60.ptx by default in strings_udf (#12056) @brandon-b-miller
Force black exclusions for pre-commit. (#12036) @bdice
Add memory_usage & items implementation for Struct column & dtype (#12033) @galipremsagar
Reduce/Remove reliance on **kwargs and *args in IO readers & writers (#12025) @galipremsagar
Fixes bug in csvreaderoptions construction in cython (#12021) @karthikeyann
Fix issues when both usecols and names options are used in read_csv (#12018) @vuule
Port thrust's pinned_allocator to cudf, since Thrust 1.17 removes the type (#12004) @robertmaynard
Revert "Replace most of preprocessor usage in nvcomp adapter with constexpr" (#11999) @vuule
Fix bug where df.loc resulting in single row could give wrong index (#11998) @eriknw
Switch to DISABLEDEPRECATIONWARNINGS to match other RAPIDS projects (#11989) @robertmaynard
Fix maximum page size estimate in Parquet writer (#11962) @vuule
Fix local offset handling in bgzip reader (#11918) @upsj
Fix an issue reading struct-of-list types in Parquet. (#11910) @nvdbaranec
Fix memcheck error in TypeInference.Timestamp gtest (#11905) @davidwendt
Fix type casting in Series.setitem (#11904) @wence-
Fix memcheck error in getdremeldata (#11903) @davidwendt
Fixes Unsupported column type error due to empty list columns in Nested JSON reader (#11897) @karthikeyann
Fix segmented-sort to ignore indices outside the offsets (#11888) @davidwendt
Fix cudf::stablesortedorder for NaN and -NaN in FLOAT64 columns (#11874) @davidwendt
Fix writing of Parquet files with many fragments (#11869) @etseidl
Fix RangeIndex unary operators. (#11868) @vyasr
JNI Avoid NPE for reading host binary data (#11865) @revans2
Fix decimal benchmark input data generation (#11863) @karthikeyann
Fix pre-commit copyright check (#11860) @galipremsagar
Fix Parquet support for seconds and milliseconds duration types (#11854) @vuule
Ensure better compiler cache results between cudf cal-ver branches (#11835) @robertmaynard
Fix makecolumnfrom_scalar for all-null strings column (#11807) @davidwendt
Tell jitify_preprocess where to search for libnvrtc (#11787) @robertmaynard
add V2 page header support to parquet reader (#11778) @etseidl
Parquet reader: bug fix for a numrows/skiprows corner case, w/optimization for nested preprocessing (#11752) @nvdbaranec
Determine if Arrow has S3 support at runtime in unit test. (#11560) @bdice

📖 Documentation

Use rapidsai CODEOFCONDUCT.md (#12166) @bdice
Add symlinks to notebooks. (#12128) @bdice
Add truncate API to python doc pages (#12109) @galipremsagar
Update Numba docs links. (#12107) @bdice
Remove "Multi-GPU with Dask-cuDF" notebook. (#12095) @bdice
Fix link to c++ developer guide from CONTRIBUTING.md (#12084) @brandon-b-miller
Add pivot_table and crosstab to docs. (#12014) @bdice
Fix doxygen text for cudf::dictionary::encode (#11991) @davidwendt
Replace defaultstreamvalue with getdefaultstream in docs. (#11985) @vyasr
Add dtype docs pages and docstrings for cudf specific dtypes (#11974) @galipremsagar
Update Unit Testing in libcudf guidelines to code tests outside the cudf::test namespace (#11959) @davidwendt
Rename libcudf++ to libcudf. (#11953) @bdice
Fix documentation referring to removed asgpumatrix method. (#11937) @bdice
Remove "experimental" warning for struct columns in ORC reader and writer (#11880) @vuule
Initial draft of policies and guidelines for libcudf usage. (#11853) @vyasr
Add clear indication of non-GPU accelerated parameters in read_json docstring (#11825) @GregoryKimball
Add developer docs for writing tests (#11199) @vyasr

🚀 New Features

Adds an EventHandler to Java MemoryBuffer to be invoked on close (#12125) @abellina
Support + in strings_udf (#12117) @brandon-b-miller
Support upper and lower in strings_udf (#12099) @brandon-b-miller
Add wheel builds (#12096) @vyasr
Allow setting malloc heap size in string udfs (#12094) @brandon-b-miller
Support strip, lstrip, and rstrip in strings_udf (#12091) @brandon-b-miller
Mark nvcomp zstd compression stable (#12059) @jbrennan333
Add debug-only onAllocated/onDeallocated to RmmEventHandler (#12054) @abellina
Enable building against the libarrow contained in pyarrow (#12034) @vyasr
Add strings like jni and native method (#12032) @cindyyuanjiang
Cleanup common parsing code in JSON, CSV reader (#12022) @karthikeyann
byte_range support for JSON Lines format (#12017) @karthikeyann
Minor cleanup of root CMakeLists.txt for better organization (#11988) @robertmaynard
Add inplace arithmetic operators to MaskedType (#11987) @brandon-b-miller
Implement JNI for chunked Parquet reader (#11961) @ttnghia
Add method argument to DataFrame.quantile (#11957) @rjzamora
Add gpu memory watermark apis to JNI (#11950) @abellina
Adds retryCount to RmmEventHandler.onAllocFailure (#11940) @abellina
Enable returning string data from UDFs used through apply (#11933) @brandon-b-miller
Switch over to rapids-cmake patches for thrust (#11921) @robertmaynard
Add strings udf C++ classes and functions for phase II (#11912) @davidwendt
Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
Enable CEC for strings_udf (#11884) @brandon-b-miller
ArrowIPCTableWriter writes en empty batch in the case of an empty table. (#11883) @firestarman
Implement chunked Parquet reader (#11867) @ttnghia
Add read_orc_metadata to libcudf (#11815) @vuule
Support nested types as groupby keys in libcudf (#11792) @PointKernel
Adding feature Truncate to DataFrame and Series (#11435) @VamsiTallam95

🛠️ Improvements

Reduce number of tests marked spilling (#12197) @madsbk
Pin dask and distributed for release (#12165) @galipremsagar
Don't rely on GNU find in headers_test.sh (#12164) @wence-
Update cp.clip call (#12148) @quasiben
Enable automatic column projection in groupby().agg (#12124) @rjzamora
Refactor purge_nonempty_nulls (#12111) @ttnghia
Create an int8 column in read_csv when all elements are missing (#12110) @vuule
Spilling to host memory (#12106) @madsbk
First pass of pd.read_orc changes in tests (#12103) @galipremsagar
Expose engine argument in daskcudf.readjson (#12101) @rjzamora
Remove CUDA 10 compatibility code. (#12088) @bdice
Move and update dask nigthly install in CI (#12082) @galipremsagar
Throw an error when libcudf is built without cuFile and LIBCUDF_CUFILE_POLICY is set to "ALWAYS" (#12080) @vuule
Remove macros that inspect the contents of exceptions (#12076) @vyasr
Fix ingestrawdata performance issue in Nested JSON reader due to RVO (#12070) @karthikeyann
Remove overflow error during decimal binops (#12063) @galipremsagar
Change cudf::detail::tdigest to cudf::tdigest::detail (#12050) @davidwendt
Fix quantile gtests coded in namespace cudf::test (#12049) @davidwendt
Add support for DataFrame.from_dict`todictandSeries.todict` (#12048) @galipremsagar
Refactor Parquet reader (#12046) @ttnghia
Forward merge 22.10 into 22.12 (#12045) @vyasr
Standardize newlines at ends of files. (#12042) @bdice
Trim trailing whitespace from all files. (#12041) @bdice
Use nosync policy in gather and scatter implementations. (#12038) @bdice
Remove smart quotes from all docstrings. (#12035) @bdice
Update cuda-python dependency to 11.7.1 (#12030) @galipremsagar
Add cython-lint to pre-commit checks. (#12020) @bdice
Use pragma once (#12019) @bdice
New GHA to add issues/prs to project board (#12016) @jarmak-nv
Add DataFrame.pivot_table. (#12015) @bdice
Rollback of DeviceBufferLike (#12009) @madsbk
Remove default parameters for nvtext::detail functions (#12007) @davidwendt
Remove default parameters for cudf::dictionary::detail functions (#12006) @davidwendt
Remove unused managed_allocator (#12005) @vyasr
Remove default parameters for cudf::strings::detail functions (#12003) @davidwendt
Remove unnecessary code from dask-cudf _Frame (#12001) @rjzamora
Ignore python docs build artifacts (#12000) @galipremsagar
Use rapids-cmake for google benchmark. (#11997) @vyasr
Leverage rapids_cython for more automated RPATH handling (#11996) @vyasr
Remove stale labeler (#11995) @raydouglass
Move protobuf compilation to CMake (#11986) @vyasr
Replace most of preprocessor usage in nvcomp adapter with constexpr (#11980) @vuule
Add missing noexcepts to columninmetadata methods (#11973) @vyasr
Pass column names to write_csv instead of table_metadata pointer (#11972) @vuule
Accelerate libcudf segmented sort with CUB segmented sort (#11969) @davidwendt
Feature/remove default streams (#11967) @vyasr
Add pool memory resource to libcudf basic example (#11966) @davidwendt
Fix some libcudf calls to cudf::detail::gather (#11963) @davidwendt
Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
Add deprecation warning for set_allocator. (#11958) @vyasr
Fix lists and structs gtests coded in namespace cudf::test (#11956) @davidwendt
Add full page indexes to Parquet writer benchmarks (#11955) @etseidl
Use gather-based strings factory in cudf::strings::strip (#11954) @davidwendt
Default to equal NaNs in makemergesets_aggregation. (#11952) @bdice
Add strip_delimiters option to read_text (#11946) @upsj
Refactor multibytesplit `outputbuilder` (#11945) @upsj
Remove validation that requires introspection (#11938) @vyasr
Add .str.find_multiple API (#11928) @galipremsagar
Add regex_program class for use with all regex APIs (#11927) @davidwendt
Enable backend dispatching for Dask-DataFrame creation (#11920) @rjzamora
Performance improvement in JSON Tree traversal (#11919) @karthikeyann
Fix some gtests incorrectly coded in namespace cudf::test (part I) (#11917) @davidwendt
Refactor pad/zfill functions for reuse with strings udf (#11914) @davidwendt
Add nanosecond & microsecond to DatetimeProperties (#11911) @galipremsagar
Pin mimesis version in setup.py. (#11906) @bdice
Error on ListColumn or any new unsupported column in cudf.Index (#11902) @galipremsagar
Add thrust output iterator fix (1805) to thrust.patch (#11900) @davidwendt
Relax codecov threshold diff (#11899) @galipremsagar
Use public APIs in STREAMCOMPACTIONNVBENCH (#11892) @GregoryKimball
Add coverage for string UDF tests. (#11891) @vyasr
Provide data_chunk_source wrapper for datasource (#11886) @upsj
Handle multibyte_split byte_range out-of-bounds offsets on host (#11885) @upsj
Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
Change expectstringsempty into expectcolumnempty libcudf test utility (#11873) @davidwendt
Add ngroup (#11871) @shwina
Reduce memory usage in nested JSON parser - tree generation (#11864) @karthikeyann
Unpin dask and distributed for development (#11859) @galipremsagar
Remove unused includes for table/row_operators (#11857) @GregoryKimball
Use conda-forge's pyorc (#11855) @jakirkham
Add libcudf strings examples (#11849) @davidwendt
Remove cudf_io namespace alias (#11827) @vuule
Test/remove thrust vector usage (#11813) @vyasr
Add BGZIP reader to python read_text (#11802) @upsj
Merge branch-22.10 into branch-22.12 (#11801) @davidwendt
Fix compile warning from CUDFFUNCRANGE in a member function (#11798) @davidwendt
Update cudf JNI version to 22.12.0-SNAPSHOT (#11764) @pxLi
Update flake8 to 5.0.4 and use flake8-force to check Cython. (#11736) @bdice
Add BGZIP multibyte_split benchmark (#11723) @upsj
Bifurcate Dependency Lists (#11674) @bdice
Default to equal NaNs in makecollectset_aggregation. (#11621) @bdice
Conform "bench_isin" to match generator column names (#11549) @GregoryKimball
Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
Add checks for HLG layers in dask-cudf groupby tests (#10853) @charlesbluca
part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
Make all nvcc warnings into errors (#8916) @trxcllnt

- C++
Published by GPUtester over 3 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v22.10.00

🔗 Links

🚨 Breaking Changes

Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
Disable nvCOMP DEFLATE integration (#11811) @vuule
Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
Update zfill to match Python output (#11634) @davidwendt
Upgrade pandas to 1.5 (#11617) @galipremsagar
Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
Adding optional parquet reader schema (#11524) @hyperbolic2346
Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
Remove support for skiprows / numrows options in the parquet reader. (#11503) @nvdbaranec
Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
Disable Arrow S3 support by default. (#11470) @bdice
Convert thrust::optional usages to std::optional (#11455) @robertmaynard
Remove unused is_struct trait. (#11450) @bdice
Refactor the Buffer class (#11447) @madsbk
Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
Refactor padside and striptype enums into side_type enum (#11438) @davidwendt
Remove HASHSERIALMURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
Use the new JSON parser when the experimental reader is selected (#11364) @vuule
Remove deprecated Series.applymap. (#11031) @bdice
Remove deprecated expand parameter from str.findall. (#11030) @bdice

🐛 Bug Fixes

Force using old fmt in nvbench. (#12064) @vyasr
Update cuda-python dependency to 11.7.1 (#11994) @shwina
Fixes bug in temporary decompression space estimation before calling nvcomp (#11879) @abellina
Handle ptx file paths during strings_udf import (#11862) @galipremsagar
Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
Reset strings_udf CEC and solve several related issues (#11846) @brandon-b-miller
Fix bug in new shuffle-based groupby implementation (#11836) @rjzamora
Fix is_valid checks in Scalar._binaryop (#11818) @wence-
Fix operator NotImplemented issue with numpy (#11816) @galipremsagar
Disable nvCOMP DEFLATE integration (#11811) @vuule
Build strings_udf package with other python packages in nightlies (#11808) @brandon-b-miller
Revert problematic shuffle=explicit-comms changes (#11803) @rjzamora
Fix regex out-of-bounds write in strided rows logic (#11797) @davidwendt
Build cudf locally before building strings_udf conda packages in CI (#11785) @brandon-b-miller
Fix an issue in cudf::rowbitcount involving structs and lists at multiple levels. (#11779) @nvdbaranec
Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
Fix issue with set-item incase of list and struct types (#11760) @galipremsagar
Ensure all libcudf APIs run on cudf's default stream (#11759) @vyasr
Resolve dask_cudf failures caused by upstream groupby changes (#11755) @rjzamora
Fix ORC string sum statistics (#11740) @vuule
Add strings_udf package for python 3.9 (#11730) @brandon-b-miller
Ensure that all tests launch kernels on cudf's default stream (#11726) @vyasr
Don't assume stream is a compile-time constant expression (#11725) @vyasr
Fix get_thrust.cmake format at patch command (#11715) @davidwendt
Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
Fix cudf::lists::sort_lists for NaN and Infinity values (#11703) @davidwendt
Modify ORC reader timestamp parsing to match the apache reader behavior (#11699) @vuule
Fix DataFrame.from_arrow to preserve type metadata (#11698) @galipremsagar
Fix compile error due to missing header (#11697) @ttnghia
Default to Snappy compression in to_orc when using cuDF or Dask (#11690) @vuule
Fix an issue related to Multindex when group_keys=True (#11689) @galipremsagar
Transfer correct dtype to exploded column (#11687) @wence-
Ignore protobuf generated files in mypy checks (#11685) @galipremsagar
Maintain the index name after .loc (#11677) @shwina
Fix issue with extracting nested column data & dtype preservation (#11671) @galipremsagar
Ensure that all cudf tests and benchmarks are conda env aware (#11666) @robertmaynard
Update to Thrust 1.17.2 to fix cub ODR issues (#11665) @robertmaynard
Fix multi-file remote datasource bug (#11655) @rjzamora
Fix invalid regex quantifier check to not include alternation (#11654) @davidwendt
Fix bug in device_write(): it uses an incorrect size (#11651) @madsbk
fixes overflows in benchmarks (#11649) @elstehle
Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
Fix compile error in benchmark nested_json.cpp (#11637) @davidwendt
Update zfill to match Python output (#11634) @davidwendt
Removed converted type for INT32 and INT64 since they do not convert (#11627) @hyperbolic2346
Fix host scalars construction of nested types (#11612) @galipremsagar
Fix compile warning in nestedjsongpu.cu (#11607) @davidwendt
Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
Preserve order if necessary when deduping categoricals internally (#11597) @brandon-b-miller
Add is_timestamp test for leap second (60) (#11594) @davidwendt
Fix an issue with to_arrow when column name type is not a string (#11590) @galipremsagar
Fix exception in segmented-reduce benchmark (#11588) @davidwendt
Fix encode/decode of negative timestamps in ORC reader/writer (#11586) @vuule
Correct distribution data type in quantiles benchmark (#11584) @vuule
Fix multibyte_split benchmark for host buffers (#11583) @upsj
xfail custreamz display test for now (#11567) @shwina
Fix JNI for TableWithMeta to use schemainfo instead of columnnames (#11566) @jlowe
Reduce code duplication for dask & distributed nightly/stable installs (#11565) @galipremsagar
Fix groupby failures in dask_cudf CI (#11561) @rjzamora
Fix for pivot: error when 'values' is a multicharacter string (#11538) @shaswat-indian
find_package(cudf) + arrow9 usable with cudf build directory (#11535) @robertmaynard
Fixing crash when writing binary nested data in parquet (#11526) @hyperbolic2346
Fix for: error when assigning a value to an empty series (#11523) @shaswat-indian
Fix invalid results from conditional-left-anti-join in debug build (#11517) @davidwendt
Fix cmake error after upgrading to Arrow 9 (#11513) @ttnghia
Fix reverse binary operators acting on a host value and cudf.Scalar (#11512) @bdice
Update parquet fuzz tests to drop support for skiprows & num_rows (#11505) @galipremsagar
Use rapids-cmake 22.10 best practice for RAPIDS.cmake location (#11493) @robertmaynard
Handle some zero-sized corner cases in dlpack interop (#11449) @wence-
Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
libcudf c++ example updated to CPM version 0.35.3 (#11417) @robertmaynard
Fix regex quantifier check to include capture groups (#11373) @davidwendt
Fix readtext when byterange is aligned with field (#11371) @upsj
Fix to_timestamps truncated subsecond calculation (#11367) @davidwendt
column: calculate null_count before release()ing the cudf::column (#11365) @wence-

📖 Documentation

Update guide-to-udfs notebook (#11861) @brandon-b-miller
Update docstring for cudf.read_text (#11799) @GregoryKimball
Add doc section for list & struct handling (#11770) @galipremsagar
Document that minimum required CMake version is now 3.23.1 (#11751) @robertmaynard
Update libcudf documentation build command in DOCUMENTATION.md (#11735) @davidwendt
Add docs for use of string data to DataFrame.apply and Series.apply and update guide to UDFs notebook (#11733) @brandon-b-miller
Enable more Pydocstyle rules (#11582) @bdice
Remove unused cpp/img folder (#11554) @davidwendt
Publish C++ developer docs (#11475) @vyasr
Fix a misalignment in cudf.get_dummies docstring (#11443) @galipremsagar
Update contributing doc to include links to the developer guides (#11390) @davidwendt
Fix tableviewbase doxygen format (#11340) @davidwendt
Create main developer guide for Python (#11235) @vyasr
Add developer documentation for benchmarking (#11122) @vyasr
cuDF error handling document (#7917) @isVoid

🚀 New Features

Add hasNull statistic reading ability to ORC (#11747) @devavret
Add istitle to string UDFs (#11738) @brandon-b-miller
JSON Column creation in GPU (#11714) @karthikeyann
Adds option to take explicit nested schema for nested JSON reader (#11682) @elstehle
Add BGZIP data_chunk_reader (#11652) @upsj
Support DECIMAL order-by for RANGE window functions (#11645) @mythrocks
changing version of cmake to 3.23.3 (#11619) @hyperbolic2346
Generate unique keys table in java JNI contiguousSplitGroups (#11614) @res-life
Generic type casting to support the new nested JSON reader (#11613) @elstehle
JSON tree traversal (#11610) @karthikeyann
Add casting operators to masked UDFs (#11578) @brandon-b-miller
Adds type inference and type conversion for leaf-columns to the nested JSON parser (#11574) @elstehle
Add strings 'like' function (#11558) @davidwendt
Handle hyphen as literal for regex cclass when incomplete range (#11557) @davidwendt
Enable ZSTD compression in ORC and Parquet writers (#11551) @vuule
Adds support for json lines format to the nested JSON reader (#11534) @elstehle
Adding optional parquet reader schema (#11524) @hyperbolic2346
Adds GPU implementation of JSON-token-stream to JSON-tree (#11518) @karthikeyann
Add gdb pretty-printers for simple types (#11499) @upsj
Add create_random_column function to the data generator (#11490) @vuule
Add fluent API builder to data_profile (#11479) @vuule
Adds Nested Json benchmark (#11466) @karthikeyann
Convert thrust::optional usages to std::optional (#11455) @robertmaynard
Python API for the future experimental JSON reader (#11426) @vuule
Return schema info from JSON reader (#11419) @vuule
Add regex ASCII flag support for matching builtin character classes (#11404) @davidwendt
Truncate parquet column indexes (#11403) @etseidl
Adds the end-to-end JSON parser implementation (#11388) @elstehle
Use the new JSON parser when the experimental reader is selected (#11364) @vuule
Add placeholder for the experimental JSON reader (#11334) @vuule
Add read-only functions on string dtypes to DataFrame.apply and Series.apply (#11319) @brandon-b-miller
Added 'crosstab' and 'pivot_table' features (#11314) @shaswat-indian
Quickly error out when trying to build with unsupported nvcc versions (#11297) @robertmaynard
Adds JSON tokenizer (#11264) @elstehle
List lexicographic comparator (#11129) @devavret
Add generic type inference for cuIO (#11121) @PointKernel
Fully support nested types in cudf::contains (#10656) @ttnghia
Support nested types in lists::contains (#10548) @ttnghia

🛠️ Improvements

Pin dask and distributed for release (#11822) @galipremsagar
Add examples for Nested JSON reader (#11814) @GregoryKimball
Support shuffle-based groupby aggregations in dask_cudf (#11800) @rjzamora
Update strings udf version updater script (#11772) @galipremsagar
Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
Pass dtype param to avoid pd.Series warnings (#11761) @galipremsagar
Enable schema_element & keep_quotes support in json reader (#11746) @galipremsagar
Add ability to construct ListColumn when size is None (#11745) @galipremsagar
Reduces memory requirements in JSON parser and adds bytes/s and peak memory usage to benchmarks (#11732) @elstehle
Add missing copyright headers. (#11712) @bdice
Fix copyright check issues in pre-commit (#11711) @bdice
Include decimal in supported types for range window order-by columns (#11710) @mythrocks
Disable very large column gtest for contiguous-split (#11706) @davidwendt
Drop split_out=None test from groupby.agg (#11704) @wence-
Use CubinLinker for CUDA Minor Version Compatibility (#11701) @gmarkall
Add regex capture-group parameter to auto convert to non-capture groups (#11695) @davidwendt
Add a __dataframe__ method to the protocol dataframe object (#11692) @rgommers
Special-case multibyte_split for single-byte delimiter (#11681) @upsj
Remove isort exclusions (#11680) @bdice
Refactor CSV reader benchmarks with nvbench (#11678) @PointKernel
Check conda recipe headers with pre-commit (#11669) @bdice
Remove redundant style check for clang-format. (#11668) @bdice
Add support for group_keys in groupby (#11659) @galipremsagar
Fix pandoc pinning. (#11658) @bdice
Revert removal of skiprows / numrows options from the Parquet reader. (#11657) @nvdbaranec
Update git metadata (#11647) @bdice
Call setnullcount on a returning column if null-count is known (#11646) @davidwendt
Fix some libcudf detail calls not passing the stream variable (#11642) @davidwendt
Update to mypy 0.971 (#11640) @wence-
Refactor strings strip functor to details header (#11635) @davidwendt
Fix incorrect nullCount in get_json_object (#11633) @trxcllnt
Simplify hostdevice_vector (#11631) @upsj
Refactor parquet writer benchmarks with nvbench (#11623) @PointKernel
Rework contains_scalar to check nulls at runtime (#11622) @davidwendt
Fix incorrect memory resource used in rolling temp columns (#11618) @mythrocks
Upgrade pandas to 1.5 (#11617) @galipremsagar
Move type-dispatcher calls from traits.hpp to traits.cpp (#11616) @davidwendt
Refactor parquet reader benchmarks with nvbench (#11611) @PointKernel
Forward-merge branch-22.08 to branch-22.10 (#11608) @bdice
Use stream in Java API. (#11601) @bdice
Refactors of public/detail APIs, CUDFFUNCRANGE, stream handling. (#11600) @bdice
Improve ORC writer benchmark with nvbench (#11598) @PointKernel
Tune multibyte_split kernel (#11587) @upsj
Move split_utils.cuh to strings/detail (#11585) @davidwendt
Fix warnings due to compiler regression with if constexpr (#11581) @ttnghia
Add full 24-bit dictionary support to Parquet writer (#11580) @etseidl
Expose "explicit-comms" option in shuffle-based dask_cudf functions (#11576) @rjzamora
Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
Refactor daskcudf groupby to use applyconcat_apply (#11571) @rjzamora
Add ability to write list(struct) columns as map type in orc writer (#11568) @galipremsagar
Add byterange to multibytesplit benchmark + NVBench refactor (#11562) @upsj
JNI support for writing binary columns in parquet (#11556) @revans2
Support additional dictionary bit widths in Parquet writer (#11547) @etseidl
Refactor string/numeric conversion utilities (#11545) @davidwendt
Removing unnecessary asserts in parquet tests (#11544) @hyperbolic2346
Clean up ORC reader benchmarks with NVBench (#11543) @PointKernel
Reuse MurmurHash3_32 in Parquet page data. (#11528) @bdice
Add hexadecimal value separators (#11527) @bdice
Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
Struct support for NULL_EQUALS binary operation (#11520) @rwlee
Bump hadoop-common from 3.2.3 to 3.2.4 in /java (#11516) @dependabot[bot]
Fix Feather test warning. (#11511) @bdice
copyrange ballotsyncs to have no execution dependency (#11508) @robertmaynard
Upgrade to arrow-9.x (#11507) @galipremsagar
Remove support for skiprows / numrows options in the parquet reader. (#11503) @nvdbaranec
Single-pass multibyte_split (#11500) @upsj
Sanitize percentile_approx() output for empty input (#11498) @SrikarVanavasam
Unpin dask and distributed for development (#11492) @galipremsagar
Move SparkMurmurHash3_32 functor. (#11489) @bdice
Refactor group_nunique.cu to use nullate::DYNAMIC for reduce-by-key functor (#11482) @davidwendt
Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
Add reduction distinct_count benchmark (#11473) @ttnghia
Add groupby nunique aggregation benchmark (#11472) @ttnghia
Disable Arrow S3 support by default. (#11470) @bdice
Add groupby max aggregation benchmark (#11464) @ttnghia
Extract Dremel encoding code from Parquet (#11461) @vyasr
Add missing Thrust #includes. (#11457) @bdice
Make CMake hooks verbose (#11456) @vyasr
Control Parquet page size through Python API (#11454) @etseidl
Add control of Parquet column index creation to python (#11453) @etseidl
Remove unused is_struct trait. (#11450) @bdice
Refactor the Buffer class (#11447) @madsbk
Refactor padside and striptype enums into side_type enum (#11438) @davidwendt
Update to Thrust 1.17.0 (#11437) @bdice
Add in JNI for parsing JSON data and getting the metadata back too. (#11431) @revans2
Convert bytearrayview to use std::byte (#11424) @hyperbolic2346
Deprecate unflattennestedcolumns (#11421) @SrikarVanavasam
Remove HASHSERIALMURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
Add Spark list hashing Java tests (#11379) @bdice
Move cmake to the build section. (#11376) @vyasr
Remove use of CUDA driver API calls from libcudf (#11370) @shwina
Add column constructor from device_uvector&& (#11356) @SrikarVanavasam
Remove unused custreamz thirdparty directory (#11343) @vyasr
Update jni version to 22.10.0-SNAPSHOT (#11338) @pxLi
Enable using upstream jitify2 (#11287) @shwina
Cache cudf.Scalar (#11246) @shwina
Remove deprecated Series.applymap. (#11031) @bdice
Remove deprecated expand parameter from str.findall. (#11030) @bdice

- C++
Published by rapids-bot[bot] over 3 years ago

https://github.com/rapidsai/cudf - v22.10.01

🚨 Breaking Changes

Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
Disable nvCOMP DEFLATE integration (#11811) @vuule
Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
Update zfill to match Python output (#11634) @davidwendt
Upgrade pandas to 1.5 (#11617) @galipremsagar
Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
Adding optional parquet reader schema (#11524) @hyperbolic2346
Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
Remove support for skiprows / numrows options in the parquet reader. (#11503) @nvdbaranec
Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
Disable Arrow S3 support by default. (#11470) @bdice
Convert thrust::optional usages to std::optional (#11455) @robertmaynard
Remove unused is_struct trait. (#11450) @bdice
Refactor the Buffer class (#11447) @madsbk
Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
Refactor padside and striptype enums into side_type enum (#11438) @davidwendt
Remove HASHSERIALMURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
Use the new JSON parser when the experimental reader is selected (#11364) @vuule
Remove deprecated Series.applymap. (#11031) @bdice
Remove deprecated expand parameter from str.findall. (#11030) @bdice

🐛 Bug Fixes

Update cuda-python dependency to 11.7.1 (#11994) @shwina
Fixes bug in temporary decompression space estimation before calling nvcomp (#11879) @abellina
Handle ptx file paths during strings_udf import (#11862) @galipremsagar
Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
Reset strings_udf CEC and solve several related issues (#11846) @brandon-b-miller
Fix bug in new shuffle-based groupby implementation (#11836) @rjzamora
Fix is_valid checks in Scalar._binaryop (#11818) @wence-
Fix operator NotImplemented issue with numpy (#11816) @galipremsagar
Disable nvCOMP DEFLATE integration (#11811) @vuule
Build strings_udf package with other python packages in nightlies (#11808) @brandon-b-miller
Revert problematic shuffle=explicit-comms changes (#11803) @rjzamora
Fix regex out-of-bounds write in strided rows logic (#11797) @davidwendt
Build cudf locally before building strings_udf conda packages in CI (#11785) @brandon-b-miller
Fix an issue in cudf::rowbitcount involving structs and lists at multiple levels. (#11779) @nvdbaranec
Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
Fix issue with set-item incase of list and struct types (#11760) @galipremsagar
Ensure all libcudf APIs run on cudf's default stream (#11759) @vyasr
Resolve dask_cudf failures caused by upstream groupby changes (#11755) @rjzamora
Fix ORC string sum statistics (#11740) @vuule
Add strings_udf package for python 3.9 (#11730) @brandon-b-miller
Ensure that all tests launch kernels on cudf's default stream (#11726) @vyasr
Don't assume stream is a compile-time constant expression (#11725) @vyasr
Fix get_thrust.cmake format at patch command (#11715) @davidwendt
Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
Fix cudf::lists::sort_lists for NaN and Infinity values (#11703) @davidwendt
Modify ORC reader timestamp parsing to match the apache reader behavior (#11699) @vuule
Fix DataFrame.from_arrow to preserve type metadata (#11698) @galipremsagar
Fix compile error due to missing header (#11697) @ttnghia
Default to Snappy compression in to_orc when using cuDF or Dask (#11690) @vuule
Fix an issue related to Multindex when group_keys=True (#11689) @galipremsagar
Transfer correct dtype to exploded column (#11687) @wence-
Ignore protobuf generated files in mypy checks (#11685) @galipremsagar
Maintain the index name after .loc (#11677) @shwina
Fix issue with extracting nested column data & dtype preservation (#11671) @galipremsagar
Ensure that all cudf tests and benchmarks are conda env aware (#11666) @robertmaynard
Update to Thrust 1.17.2 to fix cub ODR issues (#11665) @robertmaynard
Fix multi-file remote datasource bug (#11655) @rjzamora
Fix invalid regex quantifier check to not include alternation (#11654) @davidwendt
Fix bug in device_write(): it uses an incorrect size (#11651) @madsbk
fixes overflows in benchmarks (#11649) @elstehle
Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
Fix compile error in benchmark nested_json.cpp (#11637) @davidwendt
Update zfill to match Python output (#11634) @davidwendt
Removed converted type for INT32 and INT64 since they do not convert (#11627) @hyperbolic2346
Fix host scalars construction of nested types (#11612) @galipremsagar
Fix compile warning in nestedjsongpu.cu (#11607) @davidwendt
Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
Preserve order if necessary when deduping categoricals internally (#11597) @brandon-b-miller
Add is_timestamp test for leap second (60) (#11594) @davidwendt
Fix an issue with to_arrow when column name type is not a string (#11590) @galipremsagar
Fix exception in segmented-reduce benchmark (#11588) @davidwendt
Fix encode/decode of negative timestamps in ORC reader/writer (#11586) @vuule
Correct distribution data type in quantiles benchmark (#11584) @vuule
Fix multibyte_split benchmark for host buffers (#11583) @upsj
xfail custreamz display test for now (#11567) @shwina
Fix JNI for TableWithMeta to use schemainfo instead of columnnames (#11566) @jlowe
Reduce code duplication for dask & distributed nightly/stable installs (#11565) @galipremsagar
Fix groupby failures in dask_cudf CI (#11561) @rjzamora
Fix for pivot: error when 'values' is a multicharacter string (#11538) @shaswat-indian
find_package(cudf) + arrow9 usable with cudf build directory (#11535) @robertmaynard
Fixing crash when writing binary nested data in parquet (#11526) @hyperbolic2346
Fix for: error when assigning a value to an empty series (#11523) @shaswat-indian
Fix invalid results from conditional-left-anti-join in debug build (#11517) @davidwendt
Fix cmake error after upgrading to Arrow 9 (#11513) @ttnghia
Fix reverse binary operators acting on a host value and cudf.Scalar (#11512) @bdice
Update parquet fuzz tests to drop support for skiprows & num_rows (#11505) @galipremsagar
Use rapids-cmake 22.10 best practice for RAPIDS.cmake location (#11493) @robertmaynard
Handle some zero-sized corner cases in dlpack interop (#11449) @wence-
Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
libcudf c++ example updated to CPM version 0.35.3 (#11417) @robertmaynard
Fix regex quantifier check to include capture groups (#11373) @davidwendt
Fix readtext when byterange is aligned with field (#11371) @upsj
Fix to_timestamps truncated subsecond calculation (#11367) @davidwendt
column: calculate null_count before release()ing the cudf::column (#11365) @wence-

📖 Documentation

Update guide-to-udfs notebook (#11861) @brandon-b-miller
Update docstring for cudf.read_text (#11799) @GregoryKimball
Add doc section for list & struct handling (#11770) @galipremsagar
Document that minimum required CMake version is now 3.23.1 (#11751) @robertmaynard
Update libcudf documentation build command in DOCUMENTATION.md (#11735) @davidwendt
Add docs for use of string data to DataFrame.apply and Series.apply and update guide to UDFs notebook (#11733) @brandon-b-miller
Enable more Pydocstyle rules (#11582) @bdice
Remove unused cpp/img folder (#11554) @davidwendt
Publish C++ developer docs (#11475) @vyasr
Fix a misalignment in cudf.get_dummies docstring (#11443) @galipremsagar
Update contributing doc to include links to the developer guides (#11390) @davidwendt
Fix tableviewbase doxygen format (#11340) @davidwendt
Create main developer guide for Python (#11235) @vyasr
Add developer documentation for benchmarking (#11122) @vyasr
cuDF error handling document (#7917) @isVoid

🚀 New Features

Add hasNull statistic reading ability to ORC (#11747) @devavret
Add istitle to string UDFs (#11738) @brandon-b-miller
JSON Column creation in GPU (#11714) @karthikeyann
Adds option to take explicit nested schema for nested JSON reader (#11682) @elstehle
Add BGZIP data_chunk_reader (#11652) @upsj
Support DECIMAL order-by for RANGE window functions (#11645) @mythrocks
changing version of cmake to 3.23.3 (#11619) @hyperbolic2346
Generate unique keys table in java JNI contiguousSplitGroups (#11614) @res-life
Generic type casting to support the new nested JSON reader (#11613) @elstehle
JSON tree traversal (#11610) @karthikeyann
Add casting operators to masked UDFs (#11578) @brandon-b-miller
Adds type inference and type conversion for leaf-columns to the nested JSON parser (#11574) @elstehle
Add strings 'like' function (#11558) @davidwendt
Handle hyphen as literal for regex cclass when incomplete range (#11557) @davidwendt
Enable ZSTD compression in ORC and Parquet writers (#11551) @vuule
Adds support for json lines format to the nested JSON reader (#11534) @elstehle
Adding optional parquet reader schema (#11524) @hyperbolic2346
Adds GPU implementation of JSON-token-stream to JSON-tree (#11518) @karthikeyann
Add gdb pretty-printers for simple types (#11499) @upsj
Add create_random_column function to the data generator (#11490) @vuule
Add fluent API builder to data_profile (#11479) @vuule
Adds Nested Json benchmark (#11466) @karthikeyann
Convert thrust::optional usages to std::optional (#11455) @robertmaynard
Python API for the future experimental JSON reader (#11426) @vuule
Return schema info from JSON reader (#11419) @vuule
Add regex ASCII flag support for matching builtin character classes (#11404) @davidwendt
Truncate parquet column indexes (#11403) @etseidl
Adds the end-to-end JSON parser implementation (#11388) @elstehle
Use the new JSON parser when the experimental reader is selected (#11364) @vuule
Add placeholder for the experimental JSON reader (#11334) @vuule
Add read-only functions on string dtypes to DataFrame.apply and Series.apply (#11319) @brandon-b-miller
Added 'crosstab' and 'pivot_table' features (#11314) @shaswat-indian
Quickly error out when trying to build with unsupported nvcc versions (#11297) @robertmaynard
Adds JSON tokenizer (#11264) @elstehle
List lexicographic comparator (#11129) @devavret
Add generic type inference for cuIO (#11121) @PointKernel
Fully support nested types in cudf::contains (#10656) @ttnghia
Support nested types in lists::contains (#10548) @ttnghia

🛠️ Improvements

Pin dask and distributed for release (#11822) @galipremsagar
Add examples for Nested JSON reader (#11814) @GregoryKimball
Support shuffle-based groupby aggregations in dask_cudf (#11800) @rjzamora
Update strings udf version updater script (#11772) @galipremsagar
Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
Pass dtype param to avoid pd.Series warnings (#11761) @galipremsagar
Enable schema_element & keep_quotes support in json reader (#11746) @galipremsagar
Add ability to construct ListColumn when size is None (#11745) @galipremsagar
Reduces memory requirements in JSON parser and adds bytes/s and peak memory usage to benchmarks (#11732) @elstehle
Add missing copyright headers. (#11712) @bdice
Fix copyright check issues in pre-commit (#11711) @bdice
Include decimal in supported types for range window order-by columns (#11710) @mythrocks
Disable very large column gtest for contiguous-split (#11706) @davidwendt
Drop split_out=None test from groupby.agg (#11704) @wence-
Use CubinLinker for CUDA Minor Version Compatibility (#11701) @gmarkall
Add regex capture-group parameter to auto convert to non-capture groups (#11695) @davidwendt
Add a __dataframe__ method to the protocol dataframe object (#11692) @rgommers
Special-case multibyte_split for single-byte delimiter (#11681) @upsj
Remove isort exclusions (#11680) @bdice
Refactor CSV reader benchmarks with nvbench (#11678) @PointKernel
Check conda recipe headers with pre-commit (#11669) @bdice
Remove redundant style check for clang-format. (#11668) @bdice
Add support for group_keys in groupby (#11659) @galipremsagar
Fix pandoc pinning. (#11658) @bdice
Revert removal of skiprows / numrows options from the Parquet reader. (#11657) @nvdbaranec
Update git metadata (#11647) @bdice
Call setnullcount on a returning column if null-count is known (#11646) @davidwendt
Fix some libcudf detail calls not passing the stream variable (#11642) @davidwendt
Update to mypy 0.971 (#11640) @wence-
Refactor strings strip functor to details header (#11635) @davidwendt
Fix incorrect nullCount in get_json_object (#11633) @trxcllnt
Simplify hostdevice_vector (#11631) @upsj
Refactor parquet writer benchmarks with nvbench (#11623) @PointKernel
Rework contains_scalar to check nulls at runtime (#11622) @davidwendt
Fix incorrect memory resource used in rolling temp columns (#11618) @mythrocks
Upgrade pandas to 1.5 (#11617) @galipremsagar
Move type-dispatcher calls from traits.hpp to traits.cpp (#11616) @davidwendt
Refactor parquet reader benchmarks with nvbench (#11611) @PointKernel
Forward-merge branch-22.08 to branch-22.10 (#11608) @bdice
Use stream in Java API. (#11601) @bdice
Refactors of public/detail APIs, CUDFFUNCRANGE, stream handling. (#11600) @bdice
Improve ORC writer benchmark with nvbench (#11598) @PointKernel
Tune multibyte_split kernel (#11587) @upsj
Move split_utils.cuh to strings/detail (#11585) @davidwendt
Fix warnings due to compiler regression with if constexpr (#11581) @ttnghia
Add full 24-bit dictionary support to Parquet writer (#11580) @etseidl
Expose "explicit-comms" option in shuffle-based dask_cudf functions (#11576) @rjzamora
Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
Refactor daskcudf groupby to use applyconcat_apply (#11571) @rjzamora
Add ability to write list(struct) columns as map type in orc writer (#11568) @galipremsagar
Add byterange to multibytesplit benchmark + NVBench refactor (#11562) @upsj
JNI support for writing binary columns in parquet (#11556) @revans2
Support additional dictionary bit widths in Parquet writer (#11547) @etseidl
Refactor string/numeric conversion utilities (#11545) @davidwendt
Removing unnecessary asserts in parquet tests (#11544) @hyperbolic2346
Clean up ORC reader benchmarks with NVBench (#11543) @PointKernel
Reuse MurmurHash3_32 in Parquet page data. (#11528) @bdice
Add hexadecimal value separators (#11527) @bdice
Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
Struct support for NULL_EQUALS binary operation (#11520) @rwlee
Bump hadoop-common from 3.2.3 to 3.2.4 in /java (#11516) @dependabot[bot]
Fix Feather test warning. (#11511) @bdice
copyrange ballotsyncs to have no execution dependency (#11508) @robertmaynard
Upgrade to arrow-9.x (#11507) @galipremsagar
Remove support for skiprows / numrows options in the parquet reader. (#11503) @nvdbaranec
Single-pass multibyte_split (#11500) @upsj
Sanitize percentile_approx() output for empty input (#11498) @SrikarVanavasam
Unpin dask and distributed for development (#11492) @galipremsagar
Move SparkMurmurHash3_32 functor. (#11489) @bdice
Refactor group_nunique.cu to use nullate::DYNAMIC for reduce-by-key functor (#11482) @davidwendt
Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
Add reduction distinct_count benchmark (#11473) @ttnghia
Add groupby nunique aggregation benchmark (#11472) @ttnghia
Disable Arrow S3 support by default. (#11470) @bdice
Add groupby max aggregation benchmark (#11464) @ttnghia
Extract Dremel encoding code from Parquet (#11461) @vyasr
Add missing Thrust #includes. (#11457) @bdice
Make CMake hooks verbose (#11456) @vyasr
Control Parquet page size through Python API (#11454) @etseidl
Add control of Parquet column index creation to python (#11453) @etseidl
Remove unused is_struct trait. (#11450) @bdice
Refactor the Buffer class (#11447) @madsbk
Refactor padside and striptype enums into side_type enum (#11438) @davidwendt
Update to Thrust 1.17.0 (#11437) @bdice
Add in JNI for parsing JSON data and getting the metadata back too. (#11431) @revans2
Convert bytearrayview to use std::byte (#11424) @hyperbolic2346
Deprecate unflattennestedcolumns (#11421) @SrikarVanavasam
Remove HASHSERIALMURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
Add Spark list hashing Java tests (#11379) @bdice
Move cmake to the build section. (#11376) @vyasr
Remove use of CUDA driver API calls from libcudf (#11370) @shwina
Add column constructor from device_uvector&& (#11356) @SrikarVanavasam
Remove unused custreamz thirdparty directory (#11343) @vyasr
Update jni version to 22.10.0-SNAPSHOT (#11338) @pxLi
Enable using upstream jitify2 (#11287) @shwina
Cache cudf.Scalar (#11246) @shwina
Remove deprecated Series.applymap. (#11031) @bdice
Remove deprecated expand parameter from str.findall. (#11030) @bdice

- C++
Published by GPUtester over 3 years ago

https://github.com/rapidsai/cudf - v22.10.00

🚨 Breaking Changes

Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
Disable nvCOMP DEFLATE integration (#11811) @vuule
Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
Update zfill to match Python output (#11634) @davidwendt
Upgrade pandas to 1.5 (#11617) @galipremsagar
Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
Adding optional parquet reader schema (#11524) @hyperbolic2346
Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
Remove support for skiprows / numrows options in the parquet reader. (#11503) @nvdbaranec
Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
Disable Arrow S3 support by default. (#11470) @bdice
Convert thrust::optional usages to std::optional (#11455) @robertmaynard
Remove unused is_struct trait. (#11450) @bdice
Refactor the Buffer class (#11447) @madsbk
Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
Refactor padside and striptype enums into side_type enum (#11438) @davidwendt
Remove HASHSERIALMURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
Use the new JSON parser when the experimental reader is selected (#11364) @vuule
Remove deprecated Series.applymap. (#11031) @bdice
Remove deprecated expand parameter from str.findall. (#11030) @bdice

🐛 Bug Fixes

Fixes bug in temporary decompression space estimation before calling nvcomp (#11879) @abellina
Handle ptx file paths during strings_udf import (#11862) @galipremsagar
Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
Reset strings_udf CEC and solve several related issues (#11846) @brandon-b-miller
Fix bug in new shuffle-based groupby implementation (#11836) @rjzamora
Fix is_valid checks in Scalar._binaryop (#11818) @wence-
Fix operator NotImplemented issue with numpy (#11816) @galipremsagar
Disable nvCOMP DEFLATE integration (#11811) @vuule
Build strings_udf package with other python packages in nightlies (#11808) @brandon-b-miller
Revert problematic shuffle=explicit-comms changes (#11803) @rjzamora
Fix regex out-of-bounds write in strided rows logic (#11797) @davidwendt
Build cudf locally before building strings_udf conda packages in CI (#11785) @brandon-b-miller
Fix an issue in cudf::rowbitcount involving structs and lists at multiple levels. (#11779) @nvdbaranec
Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
Fix issue with set-item incase of list and struct types (#11760) @galipremsagar
Ensure all libcudf APIs run on cudf's default stream (#11759) @vyasr
Resolve dask_cudf failures caused by upstream groupby changes (#11755) @rjzamora
Fix ORC string sum statistics (#11740) @vuule
Add strings_udf package for python 3.9 (#11730) @brandon-b-miller
Ensure that all tests launch kernels on cudf's default stream (#11726) @vyasr
Don't assume stream is a compile-time constant expression (#11725) @vyasr
Fix get_thrust.cmake format at patch command (#11715) @davidwendt
Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
Fix cudf::lists::sort_lists for NaN and Infinity values (#11703) @davidwendt
Modify ORC reader timestamp parsing to match the apache reader behavior (#11699) @vuule
Fix DataFrame.from_arrow to preserve type metadata (#11698) @galipremsagar
Fix compile error due to missing header (#11697) @ttnghia
Default to Snappy compression in to_orc when using cuDF or Dask (#11690) @vuule
Fix an issue related to Multindex when group_keys=True (#11689) @galipremsagar
Transfer correct dtype to exploded column (#11687) @wence-
Ignore protobuf generated files in mypy checks (#11685) @galipremsagar
Maintain the index name after .loc (#11677) @shwina
Fix issue with extracting nested column data & dtype preservation (#11671) @galipremsagar
Ensure that all cudf tests and benchmarks are conda env aware (#11666) @robertmaynard
Update to Thrust 1.17.2 to fix cub ODR issues (#11665) @robertmaynard
Fix multi-file remote datasource bug (#11655) @rjzamora
Fix invalid regex quantifier check to not include alternation (#11654) @davidwendt
Fix bug in device_write(): it uses an incorrect size (#11651) @madsbk
fixes overflows in benchmarks (#11649) @elstehle
Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
Fix compile error in benchmark nested_json.cpp (#11637) @davidwendt
Update zfill to match Python output (#11634) @davidwendt
Removed converted type for INT32 and INT64 since they do not convert (#11627) @hyperbolic2346
Fix host scalars construction of nested types (#11612) @galipremsagar
Fix compile warning in nestedjsongpu.cu (#11607) @davidwendt
Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
Preserve order if necessary when deduping categoricals internally (#11597) @brandon-b-miller
Add is_timestamp test for leap second (60) (#11594) @davidwendt
Fix an issue with to_arrow when column name type is not a string (#11590) @galipremsagar
Fix exception in segmented-reduce benchmark (#11588) @davidwendt
Fix encode/decode of negative timestamps in ORC reader/writer (#11586) @vuule
Correct distribution data type in quantiles benchmark (#11584) @vuule
Fix multibyte_split benchmark for host buffers (#11583) @upsj
xfail custreamz display test for now (#11567) @shwina
Fix JNI for TableWithMeta to use schemainfo instead of columnnames (#11566) @jlowe
Reduce code duplication for dask & distributed nightly/stable installs (#11565) @galipremsagar
Fix groupby failures in dask_cudf CI (#11561) @rjzamora
Fix for pivot: error when 'values' is a multicharacter string (#11538) @shaswat-indian
find_package(cudf) + arrow9 usable with cudf build directory (#11535) @robertmaynard
Fixing crash when writing binary nested data in parquet (#11526) @hyperbolic2346
Fix for: error when assigning a value to an empty series (#11523) @shaswat-indian
Fix invalid results from conditional-left-anti-join in debug build (#11517) @davidwendt
Fix cmake error after upgrading to Arrow 9 (#11513) @ttnghia
Fix reverse binary operators acting on a host value and cudf.Scalar (#11512) @bdice
Update parquet fuzz tests to drop support for skiprows & num_rows (#11505) @galipremsagar
Use rapids-cmake 22.10 best practice for RAPIDS.cmake location (#11493) @robertmaynard
Handle some zero-sized corner cases in dlpack interop (#11449) @wence-
Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
libcudf c++ example updated to CPM version 0.35.3 (#11417) @robertmaynard
Fix regex quantifier check to include capture groups (#11373) @davidwendt
Fix readtext when byterange is aligned with field (#11371) @upsj
Fix to_timestamps truncated subsecond calculation (#11367) @davidwendt
column: calculate null_count before release()ing the cudf::column (#11365) @wence-

📖 Documentation

Update guide-to-udfs notebook (#11861) @brandon-b-miller
Update docstring for cudf.read_text (#11799) @GregoryKimball
Add doc section for list & struct handling (#11770) @galipremsagar
Document that minimum required CMake version is now 3.23.1 (#11751) @robertmaynard
Update libcudf documentation build command in DOCUMENTATION.md (#11735) @davidwendt
Add docs for use of string data to DataFrame.apply and Series.apply and update guide to UDFs notebook (#11733) @brandon-b-miller
Enable more Pydocstyle rules (#11582) @bdice
Remove unused cpp/img folder (#11554) @davidwendt
Publish C++ developer docs (#11475) @vyasr
Fix a misalignment in cudf.get_dummies docstring (#11443) @galipremsagar
Update contributing doc to include links to the developer guides (#11390) @davidwendt
Fix tableviewbase doxygen format (#11340) @davidwendt
Create main developer guide for Python (#11235) @vyasr
Add developer documentation for benchmarking (#11122) @vyasr
cuDF error handling document (#7917) @isVoid

🚀 New Features

Add hasNull statistic reading ability to ORC (#11747) @devavret
Add istitle to string UDFs (#11738) @brandon-b-miller
JSON Column creation in GPU (#11714) @karthikeyann
Adds option to take explicit nested schema for nested JSON reader (#11682) @elstehle
Add BGZIP data_chunk_reader (#11652) @upsj
Support DECIMAL order-by for RANGE window functions (#11645) @mythrocks
changing version of cmake to 3.23.3 (#11619) @hyperbolic2346
Generate unique keys table in java JNI contiguousSplitGroups (#11614) @res-life
Generic type casting to support the new nested JSON reader (#11613) @elstehle
JSON tree traversal (#11610) @karthikeyann
Add casting operators to masked UDFs (#11578) @brandon-b-miller
Adds type inference and type conversion for leaf-columns to the nested JSON parser (#11574) @elstehle
Add strings 'like' function (#11558) @davidwendt
Handle hyphen as literal for regex cclass when incomplete range (#11557) @davidwendt
Enable ZSTD compression in ORC and Parquet writers (#11551) @vuule
Adds support for json lines format to the nested JSON reader (#11534) @elstehle
Adding optional parquet reader schema (#11524) @hyperbolic2346
Adds GPU implementation of JSON-token-stream to JSON-tree (#11518) @karthikeyann
Add gdb pretty-printers for simple types (#11499) @upsj
Add create_random_column function to the data generator (#11490) @vuule
Add fluent API builder to data_profile (#11479) @vuule
Adds Nested Json benchmark (#11466) @karthikeyann
Convert thrust::optional usages to std::optional (#11455) @robertmaynard
Python API for the future experimental JSON reader (#11426) @vuule
Return schema info from JSON reader (#11419) @vuule
Add regex ASCII flag support for matching builtin character classes (#11404) @davidwendt
Truncate parquet column indexes (#11403) @etseidl
Adds the end-to-end JSON parser implementation (#11388) @elstehle
Use the new JSON parser when the experimental reader is selected (#11364) @vuule
Add placeholder for the experimental JSON reader (#11334) @vuule
Add read-only functions on string dtypes to DataFrame.apply and Series.apply (#11319) @brandon-b-miller
Added 'crosstab' and 'pivot_table' features (#11314) @shaswat-indian
Quickly error out when trying to build with unsupported nvcc versions (#11297) @robertmaynard
Adds JSON tokenizer (#11264) @elstehle
List lexicographic comparator (#11129) @devavret
Add generic type inference for cuIO (#11121) @PointKernel
Fully support nested types in cudf::contains (#10656) @ttnghia
Support nested types in lists::contains (#10548) @ttnghia

🛠️ Improvements

Pin dask and distributed for release (#11822) @galipremsagar
Add examples for Nested JSON reader (#11814) @GregoryKimball
Support shuffle-based groupby aggregations in dask_cudf (#11800) @rjzamora
Update strings udf version updater script (#11772) @galipremsagar
Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
Pass dtype param to avoid pd.Series warnings (#11761) @galipremsagar
Enable schema_element & keep_quotes support in json reader (#11746) @galipremsagar
Add ability to construct ListColumn when size is None (#11745) @galipremsagar
Reduces memory requirements in JSON parser and adds bytes/s and peak memory usage to benchmarks (#11732) @elstehle
Add missing copyright headers. (#11712) @bdice
Fix copyright check issues in pre-commit (#11711) @bdice
Include decimal in supported types for range window order-by columns (#11710) @mythrocks
Disable very large column gtest for contiguous-split (#11706) @davidwendt
Drop split_out=None test from groupby.agg (#11704) @wence-
Use CubinLinker for CUDA Minor Version Compatibility (#11701) @gmarkall
Add regex capture-group parameter to auto convert to non-capture groups (#11695) @davidwendt
Add a __dataframe__ method to the protocol dataframe object (#11692) @rgommers
Special-case multibyte_split for single-byte delimiter (#11681) @upsj
Remove isort exclusions (#11680) @bdice
Refactor CSV reader benchmarks with nvbench (#11678) @PointKernel
Check conda recipe headers with pre-commit (#11669) @bdice
Remove redundant style check for clang-format. (#11668) @bdice
Add support for group_keys in groupby (#11659) @galipremsagar
Fix pandoc pinning. (#11658) @bdice
Revert removal of skiprows / numrows options from the Parquet reader. (#11657) @nvdbaranec
Update git metadata (#11647) @bdice
Call setnullcount on a returning column if null-count is known (#11646) @davidwendt
Fix some libcudf detail calls not passing the stream variable (#11642) @davidwendt
Update to mypy 0.971 (#11640) @wence-
Refactor strings strip functor to details header (#11635) @davidwendt
Fix incorrect nullCount in get_json_object (#11633) @trxcllnt
Simplify hostdevice_vector (#11631) @upsj
Refactor parquet writer benchmarks with nvbench (#11623) @PointKernel
Rework contains_scalar to check nulls at runtime (#11622) @davidwendt
Fix incorrect memory resource used in rolling temp columns (#11618) @mythrocks
Upgrade pandas to 1.5 (#11617) @galipremsagar
Move type-dispatcher calls from traits.hpp to traits.cpp (#11616) @davidwendt
Refactor parquet reader benchmarks with nvbench (#11611) @PointKernel
Forward-merge branch-22.08 to branch-22.10 (#11608) @bdice
Use stream in Java API. (#11601) @bdice
Refactors of public/detail APIs, CUDFFUNCRANGE, stream handling. (#11600) @bdice
Improve ORC writer benchmark with nvbench (#11598) @PointKernel
Tune multibyte_split kernel (#11587) @upsj
Move split_utils.cuh to strings/detail (#11585) @davidwendt
Fix warnings due to compiler regression with if constexpr (#11581) @ttnghia
Add full 24-bit dictionary support to Parquet writer (#11580) @etseidl
Expose "explicit-comms" option in shuffle-based dask_cudf functions (#11576) @rjzamora
Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
Refactor daskcudf groupby to use applyconcat_apply (#11571) @rjzamora
Add ability to write list(struct) columns as map type in orc writer (#11568) @galipremsagar
Add byterange to multibytesplit benchmark + NVBench refactor (#11562) @upsj
JNI support for writing binary columns in parquet (#11556) @revans2
Support additional dictionary bit widths in Parquet writer (#11547) @etseidl
Refactor string/numeric conversion utilities (#11545) @davidwendt
Removing unnecessary asserts in parquet tests (#11544) @hyperbolic2346
Clean up ORC reader benchmarks with NVBench (#11543) @PointKernel
Reuse MurmurHash3_32 in Parquet page data. (#11528) @bdice
Add hexadecimal value separators (#11527) @bdice
Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
Struct support for NULL_EQUALS binary operation (#11520) @rwlee
Bump hadoop-common from 3.2.3 to 3.2.4 in /java (#11516) @dependabot[bot]
Fix Feather test warning. (#11511) @bdice
copyrange ballotsyncs to have no execution dependency (#11508) @robertmaynard
Upgrade to arrow-9.x (#11507) @galipremsagar
Remove support for skiprows / numrows options in the parquet reader. (#11503) @nvdbaranec
Single-pass multibyte_split (#11500) @upsj
Sanitize percentile_approx() output for empty input (#11498) @SrikarVanavasam
Unpin dask and distributed for development (#11492) @galipremsagar
Move SparkMurmurHash3_32 functor. (#11489) @bdice
Refactor group_nunique.cu to use nullate::DYNAMIC for reduce-by-key functor (#11482) @davidwendt
Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
Add reduction distinct_count benchmark (#11473) @ttnghia
Add groupby nunique aggregation benchmark (#11472) @ttnghia
Disable Arrow S3 support by default. (#11470) @bdice
Add groupby max aggregation benchmark (#11464) @ttnghia
Extract Dremel encoding code from Parquet (#11461) @vyasr
Add missing Thrust #includes. (#11457) @bdice
Make CMake hooks verbose (#11456) @vyasr
Control Parquet page size through Python API (#11454) @etseidl
Add control of Parquet column index creation to python (#11453) @etseidl
Remove unused is_struct trait. (#11450) @bdice
Refactor the Buffer class (#11447) @madsbk
Refactor padside and striptype enums into side_type enum (#11438) @davidwendt
Update to Thrust 1.17.0 (#11437) @bdice
Add in JNI for parsing JSON data and getting the metadata back too. (#11431) @revans2
Convert bytearrayview to use std::byte (#11424) @hyperbolic2346
Deprecate unflattennestedcolumns (#11421) @SrikarVanavasam
Remove HASHSERIALMURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
Add Spark list hashing Java tests (#11379) @bdice
Move cmake to the build section. (#11376) @vyasr
Remove use of CUDA driver API calls from libcudf (#11370) @shwina
Add column constructor from device_uvector&& (#11356) @SrikarVanavasam
Remove unused custreamz thirdparty directory (#11343) @vyasr
Update jni version to 22.10.0-SNAPSHOT (#11338) @pxLi
Enable using upstream jitify2 (#11287) @shwina
Cache cudf.Scalar (#11246) @shwina
Remove deprecated Series.applymap. (#11031) @bdice
Remove deprecated expand parameter from str.findall. (#11030) @bdice

- C++
Published by GPUtester over 3 years ago

https://github.com/rapidsai/cudf - v22.08.01

🚨 Breaking Changes

Pin numpy to <1.23 (#11824) @galipremsagar
Remove legacy join APIs (#11274) @vyasr
Remove lists::drop_list_duplicates (#11236) @ttnghia
Remove Index.replace API (#11131) @vyasr
Remove deprecated Index methods from Frame (#11073) @vyasr
Remove public API of cudf.merge_sorted. (#11032) @bdice
Drop python 3.7 in code-base (#11029) @galipremsagar
Return empty dataframe when reading a Parquet file using empty columns option (#11018) @vuule
Remove Arrow CUDA IPC code (#10995) @shwina
Buffer: make .ptr read-only (#10872) @madsbk

🐛 Bug Fixes

Fix out-of-bound access in cudf::detail::label_segments (#11497) @ttnghia
Fix distributed error related to loop_in_thread (#11428) @galipremsagar
Fix atomic operations on NaN values (#11420) @ttnghia
Relax arrow pinning to just 8.x and remove cuda build dependency from cudf recipe (#11412) @kkraus14
Revert "Allow CuPy 11" (#11409) @jakirkham
Fix moto timeouts (#11369) @galipremsagar
Set +/-infinity as the identity values for floating-point numbers in device operators min and max (#11357) @ttnghia
Fix memory_usage() for ListSeries (#11355) @thomcom
Fix constructing Column from column_view with expired mask (#11354) @shwina
Handle parquet corner case: Columns with more rows than are in the row group. (#11353) @nvdbaranec
Fix DatetimeIndex & TimedeltaIndex constructors (#11342) @galipremsagar
Fix unsigned-compare compile warning in IntPow binops (#11339) @davidwendt
Fix performance issue and add a new code path to cudf::detail::contains (#11330) @ttnghia
Pin pytorch to temporarily unblock from libcupti errors (#11289) @galipremsagar
Workaround for nvcomp zstd overwriting blocks for orc due to underestimate of sizes (#11288) @jbrennan333
Fix inconsistency when hashing two tables in cudf::detail::contains (#11284) @ttnghia
Fix issue related to numpy array and category dtype (#11282) @galipremsagar
Add NotImplementedError when on is specified in DataFrame.join. (#11275) @vyasr
Fix invalid allocatelike() and emptylike() tests. (#11268) @nvdbaranec
Returns DataFrame When Concating Along Axis 1 (#11263) @isVoid
Fix compile error due to missing header (#11257) @ttnghia
Fix a memory aliasing/crash issue in scatter for lists. (#11254) @nvdbaranec
Fix tests/rolling/empty_input_test (#11238) @ttnghia
Fix const qualifier when using host_span<bitmask_type const*> (#11220) @ttnghia
Avoid using nvcompBatchedDeflateDecompressGetTempSizeEx in cuIO (#11213) @vuule
Generate benchmark data with correct run length regardless of cardinality (#11205) @vuule
Fix cumulative count index behavior (#11188) @brandon-b-miller
Fix assertion in daskcudf teststruct_explode (#11170) @rjzamora
Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager (#11161) @res-life
Fix compatibility issues with pandas 1.4.3 (#11152) @vyasr
Ensure cuco export set is installed in cmake build (#11147) @jlowe
Avoid redundant deepcopy in cudf.from_pandas (#11142) @galipremsagar
Fix compile error due to missing header (#11126) @ttnghia
Fix __cuda_array_interface__ failures (#11113) @galipremsagar
Support octal and hex within regex character class pattern (#11112) @davidwendt
Fix split_re matching logic for word boundaries (#11106) @davidwendt
Handle multiple files metadata in read_parquet (#11105) @galipremsagar
Fix index alignment for Series objects with repeated index (#11103) @shwina
FindcuFile now searches in the current CUDA Toolkit location (#11101) @robertmaynard
Fix regex word boundary logic to include underline (#11099) @davidwendt
Exclude CudaFatalTest when selecting all Java tests (#11083) @jlowe
Fix duplicate cudatoolkit pinning issue (#11070) @galipremsagar
Maintain the input index in the result of a groupby-transform (#11068) @shwina
Fix bug with row count comparison for expectcolumnsequivalent(). (#11059) @nvdbaranec
Fix BPE uninitialized size value for null and empty input strings (#11054) @davidwendt
Include missing header for usage of get_current_device_resource() (#11047) @AtlantaPepsi
Fix warnunusedresult error in parquet test (#11026) @karthikeyann
Return empty dataframe when reading a Parquet file using empty columns option (#11018) @vuule
Fix small error in page row count limiting (#10991) @etseidl
Fix a row index entry error in ORC writer issue (#10989) @vuule
Fix grouped covariance to require both values to be convertible to double. (#10891) @bdice

📖 Documentation

Defer loading of custom.js (#11465) @galipremsagar
Fix issues with day & night modes in python docs (#11400) @galipremsagar
Update missing data handling APIs in docs (#11345) @galipremsagar
Add lists filtering APIs to doxygen group. (#11336) @bdice
Remove unused import in README sample (#11318) @vyasr
Note null behavior in where docs (#11276) @brandon-b-miller
Update docstring for spans in get_row_data_range (#11271) @vyasr
Update nvCOMP integration table (#11231) @vuule
Add dev docs for documentation writing (#11217) @vyasr
Documentation fix for concatenate (#11187) @dagardner-nv
Fix unresolved links in markdown (#11173) @karthikeyann
Fix cudf version in README.md install commands (#11164) @jvanstraten
Switch language from None to "en" in docs build (#11133) @galipremsagar
Remove docs mentioning scalar_view since no such class exists. (#11132) @bdice
Add docstring entry for DataFrame.value_counts (#11039) @galipremsagar
Add docs to rolling var, std, count. (#11035) @bdice
Fix docs for Numba UDFs. (#11020) @bdice
Replace column comparison utilities functions with macros (#11007) @karthikeyann
Fix Doxygen warnings in multiple headers files (#11003) @karthikeyann
Fix doxygen warnings in utilities/ headers (#10974) @karthikeyann
Fix Doxygen warnings in table header files (#10964) @karthikeyann
Fix Doxygen warnings in column header files (#10963) @karthikeyann
Fix Doxygen warnings in strings / header files (#10937) @karthikeyann
Generate Doxygen Tag File for Libcudf (#10932) @isVoid
Fix doxygen warnings in structs, lists headers (#10923) @karthikeyann
Fix doxygen warnings in fixed_point.hpp (#10922) @karthikeyann
Fix doxygen warnings in ast/, rolling, tdigest/, wrappers/, dictionary/ headers (#10921) @karthikeyann
fix doxygen warnings in cudf/io/types.hpp, other header files (#10913) @karthikeyann
fix doxygen warnings in cudf/io/ avro, csv, json, orc, parquet header files (#10912) @karthikeyann
Fix doxygen warnings in cudf/*.hpp (#10896) @karthikeyann
Add missing documentation in aggregation.hpp (#10887) @karthikeyann
Revise PR template. (#10774) @bdice

🚀 New Features

Change cmake to allow controlling Arrow version via cmake variable (#11429) @kkraus14
Adding support for list<int8> columns to be written as byte arrays in parquet (#11328) @hyperbolic2346
Adding byte array view structure (#11322) @hyperbolic2346
Adding byte_array statistics (#11303) @hyperbolic2346
Add column indexes to Parquet writer (#11302) @etseidl
Provide an Option for Default Integer and Floating Bitwidth (#11272) @isVoid
FST benchmark (#11243) @karthikeyann
Adds the Finite-State Transducer algorithm (#11242) @elstehle
Refactor collect_set to use cudf::distinct and cudf::lists::distinct (#11228) @ttnghia
Treat zstd as stable in nvcomp releases 2.3.2 and later (#11226) @jbrennan333
Add 24 bit dictionary support to Parquet writer (#11216) @devavret
Enable positive group indices for extractAllRecord on JNI (#11215) @anthony-chang
JNI bindings for NTH_ELEMENT window aggregation (#11201) @mythrocks
Add JNI bindings for extractAllRecord (#11196) @anthony-chang
Add cudf.options (#11193) @isVoid
Add thrift support for parquet column and offset indexes (#11178) @etseidl
Adding binary read/write as options for parquet (#11160) @hyperbolic2346
Support nth_element for window functions (#11158) @mythrocks
Implement lists::distinct and cudf::detail::stable_distinct (#11149) @ttnghia
Implement Groupby pct_change (#11144) @skirui-source
Add JNI for set operations (#11143) @ttnghia
Remove deprecated PERTHREADDEFAULT_STREAM (#11134) @jbrennan333
Added a Java method to check the existence of a list of keys in a map (#11128) @razajafri
Feature/python benchmarking (#11125) @vyasr
Support nan_equality in cudf::distinct (#11118) @ttnghia
Added JNI for getMapValueForKeys (#11104) @razajafri
Refactor semi_anti_join (#11100) @ttnghia
Replace remaining instances of rmm::cudastreamdefault with cudf::defaultstreamvalue (#11082) @jbrennan333
Adds the Logical Stack algorithm (#11078) @elstehle
Add doxygen-check pre-commit hook (#11076) @karthikeyann
Use new nvCOMP API to optimize the decompression temp memory size (#11064) @vuule
Add Doxygen CI check (#11057) @karthikeyann
Support duplicate_keep_option in cudf::distinct (#11052) @ttnghia
Support set operations (#11043) @ttnghia
Support for ZLIB compression in ORC writer (#11036) @vuule
Adding feature swaplevels (#11027) @VamsiTallam95
Use nvCOMP for ZLIB decompression in ORC reader (#11024) @vuule
Function for bfill, ffill #9591 (#11022) @Sreekiran096
Generate group offsets from element labels (#11017) @ttnghia
Feature axes (#10979) @VamsiTallam95
Generate group labels from offsets (#10945) @ttnghia
Add missing cuIO benchmark coverage for duration types (#10933) @vuule
Dask-cuDF cumulative groupby ops (#10889) @brandon-b-miller
Reindex Improvements (#10815) @brandon-b-miller
Implement value_counts for DataFrame (#10813) @martinfalisse

🛠️ Improvements

Pin numpy to <1.23 (#11824) @galipremsagar
Make Index Join Tests on Default Precisions Deterministic (#11451) @isVoid
Pin dask & distributed for release (#11433) @galipremsagar
Use documented header template for doxygen (#11430) @galipremsagar
Relax arrow version in dev env (#11418) @galipremsagar
Added Java bindings for Parquet options for binary read (#11410) @razajafri
Allow CuPy 11 (#11393) @jakirkham
Improve multibyte_split performance (#11347) @cwharris
Switch death test to use explicit trap. (#11326) @vyasr
Add --output-on-failure to ctest args. (#11321) @vyasr
Consolidate remaining DataFrame/Series APIs (#11315) @vyasr
Add JNI support for the join_strings API (#11309) @revans2
Add cupy version to setup.py install_requires (#11306) @vyasr
removing some unused code (#11305) @hyperbolic2346
Add test of wildcard selection (#11300) @vyasr
Update parquet reader to take stream parameter (#11294) @PointKernel
Spark list hashing (#11292) @bdice
Remove legacy join APIs (#11274) @vyasr
Fix cudf recipes syntax (#11273) @ajschmidt8
Fix cudf recipe (#11267) @ajschmidt8
Cleanup config files (#11266) @vyasr
Run mypy on all packages (#11265) @vyasr
Update to isort 5.10.1. (#11262) @vyasr
Consolidate flake8 and pydocstyle configuration (#11260) @vyasr
Remove redundant black config specifications. (#11258) @vyasr
Ensure DeprecationWarnings are not introduced via pre-commit (#11255) @wence-
Optimization to gpu::PreprocessColumnData in parquet reader. (#11252) @nvdbaranec
Move rolling impl details to detail/ directory. (#11250) @mythrocks
Remove lists::drop_list_duplicates (#11236) @ttnghia
Use cudf::lists::distinct in Python binding (#11234) @ttnghia
Use cudf::lists::distinct in Java binding (#11233) @ttnghia
Use cudf::distinct in Java binding (#11232) @ttnghia
Pin dask-cuda in dev environment (#11229) @galipremsagar
Remove cruft in map_lookup (#11221) @mythrocks
Deprecate skiprows & num_rows in parquet reader (#11218) @galipremsagar
Remove Frame._index (#11210) @vyasr
Improve performance for cudf::contains when searching for a scalar (#11202) @ttnghia
Document why Development component is needing for CMake. (#11200) @vyasr
cleanup unused code in rolling_test.hpp (#11195) @karthikeyann
Standardize join internals around DataFrame (#11184) @vyasr
Move character case table declarations from src to detail (#11183) @davidwendt
Remove usage of Frame in StringMethods (#11181) @vyasr
Expose getjsonobject_options to Python (#11180) @SrikarVanavasam
Fix decimal128 stats in parquet writer (#11179) @etseidl
Modify CheckPageRows in parquet_test to use datasources (#11177) @etseidl
Pin max version of cuda-python to 11.7.0 (#11174) @Ethyling
Refactor and optimize Frame.where (#11168) @vyasr
Add npos const static member to cudf::string_view (#11166) @davidwendt
Move droprowsbylabel from Frame to IndexedFrame (#11157) @vyasr
Clean up copytype_metadata (#11156) @vyasr
Add nvcc conda package in dev environment (#11154) @galipremsagar
Struct binary comparison op functionality for spark rapids (#11153) @rwlee
Refactor inline conditionals. (#11151) @bdice
Refactor Spark hashing tests (#11145) @bdice
Add new _from_data_like_self factory (#11140) @vyasr
Update get_cucollections to use rapids-cmake (#11139) @vyasr
Remove unnecessary extra function for libcudacxx detection (#11138) @vyasr
Allow initial value for cudf::reduce and cudf::segmented_reduce. (#11137) @SrikarVanavasam
Remove Index.replace API (#11131) @vyasr
Move char-type table function declarations from src to detail (#11127) @davidwendt
Clean up repo root (#11124) @bdice
Improve print formatting of strings containing newline characters. (#11108) @nvdbaranec
Fix cudf::string_view::find() to return pos for empty string argument (#11107) @davidwendt
Forward-merge branch-22.06 to branch-22.08 (#11086) @bdice
Take iterators by value in clamp.cu. (#11084) @bdice
Performance improvements for row to column conversions (#11075) @hyperbolic2346
Remove deprecated Index methods from Frame (#11073) @vyasr
Use per-page max compressed size estimate for compression (#11066) @devavret
column to row refactor for performance (#11063) @hyperbolic2346
Include skbuild directory into build.sh clean operation (#11060) @galipremsagar
Unpin dask & distributed for development (#11058) @galipremsagar
Add support for Series.between (#11051) @galipremsagar
Fix groupby include (#11046) @bwyogatama
Regex cleanup internal reclass and reclass_device classes (#11045) @davidwendt
Remove public API of cudf.merge_sorted. (#11032) @bdice
Drop python 3.7 in code-base (#11029) @galipremsagar
Addition & integration of the integer power operator (#11025) @AtlantaPepsi
Refactor lists::contains (#11019) @ttnghia
Change build.sh to find C++ library by default and avoid shadowing CMAKE_ARGS (#11013) @vyasr
Clean up parquet unit test (#11005) @PointKernel
Add missing #pragma once to header files (#11004) @karthikeyann
Cleanup iterator.cuh and add fixed point support for scalar_optional_accessor (#10999) @ttnghia
Refactor cudf::contains (#10997) @ttnghia
Remove Arrow CUDA IPC code (#10995) @shwina
Change file extension for groupby benchmark (#10985) @ttnghia
Sort recipe include checks. (#10984) @bdice
Update cuCollections for thrust upgrade (#10983) @PointKernel
Expose row-group size options in cudf ParquetWriter (#10980) @rjzamora
Cleanup cudf::strings::detail::regex_parser class source (#10975) @davidwendt
Handle missing fields as nulls in getjsonobject() (#10970) @SrikarVanavasam
Fix license families to match all-caps expected by conda-verify. (#10931) @bdice
Include <optional> for GCC 11 compatibility. (#10927) @bdice
Enable builds with scikit-build (#10919) @vyasr
Improve distinct by using cuco::static_map::retrieve_all (#10916) @PointKernel
update cudfjni to 22.08.0-SNAPSHOT (#10910) @pxLi
Improve the capture of fatal cuda error (#10884) @sperlingxx
Cleanup regex compiler operators and operands source (#10879) @davidwendt
Buffer: make .ptr read-only (#10872) @madsbk
Configurable NaN handling in devicerowcomparators (#10870) @rwlee
Register cudf.core.groupby.Grouper objects to dask grouper_dispatch (#10838) @brandon-b-miller
Upgrade to arrow-8 (#10816) @galipremsagar
Remove getattr method in RangeIndex class (#10538) @skirui-source
Adding bins to value counts (#8247) @marlenezw

- C++
Published by GPUtester over 3 years ago

https://github.com/rapidsai/cudf - v22.08.00

🚨 Breaking Changes

Remove legacy join APIs (#11274) @vyasr
Remove lists::drop_list_duplicates (#11236) @ttnghia
Remove Index.replace API (#11131) @vyasr
Remove deprecated Index methods from Frame (#11073) @vyasr
Remove public API of cudf.merge_sorted. (#11032) @bdice
Drop python 3.7 in code-base (#11029) @galipremsagar
Return empty dataframe when reading a Parquet file using empty columns option (#11018) @vuule
Remove Arrow CUDA IPC code (#10995) @shwina
Buffer: make .ptr read-only (#10872) @madsbk

🐛 Bug Fixes

Fix distributed error related to loop_in_thread (#11428) @galipremsagar
Relax arrow pinning to just 8.x and remove cuda build dependency from cudf recipe (#11412) @kkraus14
Revert "Allow CuPy 11" (#11409) @jakirkham
Fix moto timeouts (#11369) @galipremsagar
Set +/-infinity as the identity values for floating-point numbers in device operators min and max (#11357) @ttnghia
Fix memory_usage() for ListSeries (#11355) @thomcom
Fix constructing Column from column_view with expired mask (#11354) @shwina
Handle parquet corner case: Columns with more rows than are in the row group. (#11353) @nvdbaranec
Fix DatetimeIndex & TimedeltaIndex constructors (#11342) @galipremsagar
Fix unsigned-compare compile warning in IntPow binops (#11339) @davidwendt
Fix performance issue and add a new code path to cudf::detail::contains (#11330) @ttnghia
Pin pytorch to temporarily unblock from libcupti errors (#11289) @galipremsagar
Workaround for nvcomp zstd overwriting blocks for orc due to underestimate of sizes (#11288) @jbrennan333
Fix inconsistency when hashing two tables in cudf::detail::contains (#11284) @ttnghia
Fix issue related to numpy array and category dtype (#11282) @galipremsagar
Add NotImplementedError when on is specified in DataFrame.join. (#11275) @vyasr
Fix invalid allocatelike() and emptylike() tests. (#11268) @nvdbaranec
Returns DataFrame When Concating Along Axis 1 (#11263) @isVoid
Fix compile error due to missing header (#11257) @ttnghia
Fix a memory aliasing/crash issue in scatter for lists. (#11254) @nvdbaranec
Fix tests/rolling/empty_input_test (#11238) @ttnghia
Fix const qualifier when using host_span<bitmask_type const*> (#11220) @ttnghia
Avoid using nvcompBatchedDeflateDecompressGetTempSizeEx in cuIO (#11213) @vuule
Generate benchmark data with correct run length regardless of cardinality (#11205) @vuule
Fix cumulative count index behavior (#11188) @brandon-b-miller
Fix assertion in daskcudf teststruct_explode (#11170) @rjzamora
Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager (#11161) @res-life
Fix compatibility issues with pandas 1.4.3 (#11152) @vyasr
Ensure cuco export set is installed in cmake build (#11147) @jlowe
Avoid redundant deepcopy in cudf.from_pandas (#11142) @galipremsagar
Fix compile error due to missing header (#11126) @ttnghia
Fix __cuda_array_interface__ failures (#11113) @galipremsagar
Support octal and hex within regex character class pattern (#11112) @davidwendt
Fix split_re matching logic for word boundaries (#11106) @davidwendt
Handle multiple files metadata in read_parquet (#11105) @galipremsagar
Fix index alignment for Series objects with repeated index (#11103) @shwina
FindcuFile now searches in the current CUDA Toolkit location (#11101) @robertmaynard
Fix regex word boundary logic to include underline (#11099) @davidwendt
Exclude CudaFatalTest when selecting all Java tests (#11083) @jlowe
Fix duplicate cudatoolkit pinning issue (#11070) @galipremsagar
Maintain the input index in the result of a groupby-transform (#11068) @shwina
Fix bug with row count comparison for expectcolumnsequivalent(). (#11059) @nvdbaranec
Fix BPE uninitialized size value for null and empty input strings (#11054) @davidwendt
Include missing header for usage of get_current_device_resource() (#11047) @AtlantaPepsi
Fix warnunusedresult error in parquet test (#11026) @karthikeyann
Return empty dataframe when reading a Parquet file using empty columns option (#11018) @vuule
Fix small error in page row count limiting (#10991) @etseidl
Fix a row index entry error in ORC writer issue (#10989) @vuule
Fix grouped covariance to require both values to be convertible to double. (#10891) @bdice

📖 Documentation

Fix issues with day & night modes in python docs (#11400) @galipremsagar
Update missing data handling APIs in docs (#11345) @galipremsagar
Add lists filtering APIs to doxygen group. (#11336) @bdice
Remove unused import in README sample (#11318) @vyasr
Note null behavior in where docs (#11276) @brandon-b-miller
Update docstring for spans in get_row_data_range (#11271) @vyasr
Update nvCOMP integration table (#11231) @vuule
Add dev docs for documentation writing (#11217) @vyasr
Documentation fix for concatenate (#11187) @dagardner-nv
Fix unresolved links in markdown (#11173) @karthikeyann
Fix cudf version in README.md install commands (#11164) @jvanstraten
Switch language from None to "en" in docs build (#11133) @galipremsagar
Remove docs mentioning scalar_view since no such class exists. (#11132) @bdice
Add docstring entry for DataFrame.value_counts (#11039) @galipremsagar
Add docs to rolling var, std, count. (#11035) @bdice
Fix docs for Numba UDFs. (#11020) @bdice
Replace column comparison utilities functions with macros (#11007) @karthikeyann
Fix Doxygen warnings in multiple headers files (#11003) @karthikeyann
Fix doxygen warnings in utilities/ headers (#10974) @karthikeyann
Fix Doxygen warnings in table header files (#10964) @karthikeyann
Fix Doxygen warnings in column header files (#10963) @karthikeyann
Fix Doxygen warnings in strings / header files (#10937) @karthikeyann
Generate Doxygen Tag File for Libcudf (#10932) @isVoid
Fix doxygen warnings in structs, lists headers (#10923) @karthikeyann
Fix doxygen warnings in fixed_point.hpp (#10922) @karthikeyann
Fix doxygen warnings in ast/, rolling, tdigest/, wrappers/, dictionary/ headers (#10921) @karthikeyann
fix doxygen warnings in cudf/io/types.hpp, other header files (#10913) @karthikeyann
fix doxygen warnings in cudf/io/ avro, csv, json, orc, parquet header files (#10912) @karthikeyann
Fix doxygen warnings in cudf/*.hpp (#10896) @karthikeyann
Add missing documentation in aggregation.hpp (#10887) @karthikeyann
Revise PR template. (#10774) @bdice

🚀 New Features

Change cmake to allow controlling Arrow version via cmake variable (#11429) @kkraus14
Adding support for list<int8> columns to be written as byte arrays in parquet (#11328) @hyperbolic2346
Adding byte array view structure (#11322) @hyperbolic2346
Adding byte_array statistics (#11303) @hyperbolic2346
Add column indexes to Parquet writer (#11302) @etseidl
Provide an Option for Default Integer and Floating Bitwidth (#11272) @isVoid
FST benchmark (#11243) @karthikeyann
Adds the Finite-State Transducer algorithm (#11242) @elstehle
Refactor collect_set to use cudf::distinct and cudf::lists::distinct (#11228) @ttnghia
Treat zstd as stable in nvcomp releases 2.3.2 and later (#11226) @jbrennan333
Add 24 bit dictionary support to Parquet writer (#11216) @devavret
Enable positive group indices for extractAllRecord on JNI (#11215) @anthony-chang
JNI bindings for NTH_ELEMENT window aggregation (#11201) @mythrocks
Add JNI bindings for extractAllRecord (#11196) @anthony-chang
Add cudf.options (#11193) @isVoid
Add thrift support for parquet column and offset indexes (#11178) @etseidl
Adding binary read/write as options for parquet (#11160) @hyperbolic2346
Support nth_element for window functions (#11158) @mythrocks
Implement lists::distinct and cudf::detail::stable_distinct (#11149) @ttnghia
Implement Groupby pct_change (#11144) @skirui-source
Add JNI for set operations (#11143) @ttnghia
Remove deprecated PERTHREADDEFAULT_STREAM (#11134) @jbrennan333
Added a Java method to check the existence of a list of keys in a map (#11128) @razajafri
Feature/python benchmarking (#11125) @vyasr
Support nan_equality in cudf::distinct (#11118) @ttnghia
Added JNI for getMapValueForKeys (#11104) @razajafri
Refactor semi_anti_join (#11100) @ttnghia
Replace remaining instances of rmm::cudastreamdefault with cudf::defaultstreamvalue (#11082) @jbrennan333
Adds the Logical Stack algorithm (#11078) @elstehle
Add doxygen-check pre-commit hook (#11076) @karthikeyann
Use new nvCOMP API to optimize the decompression temp memory size (#11064) @vuule
Add Doxygen CI check (#11057) @karthikeyann
Support duplicate_keep_option in cudf::distinct (#11052) @ttnghia
Support set operations (#11043) @ttnghia
Support for ZLIB compression in ORC writer (#11036) @vuule
Adding feature swaplevels (#11027) @VamsiTallam95
Use nvCOMP for ZLIB decompression in ORC reader (#11024) @vuule
Function for bfill, ffill #9591 (#11022) @Sreekiran096
Generate group offsets from element labels (#11017) @ttnghia
Feature axes (#10979) @VamsiTallam95
Generate group labels from offsets (#10945) @ttnghia
Add missing cuIO benchmark coverage for duration types (#10933) @vuule
Dask-cuDF cumulative groupby ops (#10889) @brandon-b-miller
Reindex Improvements (#10815) @brandon-b-miller
Implement value_counts for DataFrame (#10813) @martinfalisse

🛠️ Improvements

Pin dask & distributed for release (#11433) @galipremsagar
Use documented header template for doxygen (#11430) @galipremsagar
Relax arrow version in dev env (#11418) @galipremsagar
Allow CuPy 11 (#11393) @jakirkham
Improve multibyte_split performance (#11347) @cwharris
Switch death test to use explicit trap. (#11326) @vyasr
Add --output-on-failure to ctest args. (#11321) @vyasr
Consolidate remaining DataFrame/Series APIs (#11315) @vyasr
Add JNI support for the join_strings API (#11309) @revans2
Add cupy version to setup.py install_requires (#11306) @vyasr
removing some unused code (#11305) @hyperbolic2346
Add test of wildcard selection (#11300) @vyasr
Update parquet reader to take stream parameter (#11294) @PointKernel
Spark list hashing (#11292) @bdice
Remove legacy join APIs (#11274) @vyasr
Fix cudf recipes syntax (#11273) @ajschmidt8
Fix cudf recipe (#11267) @ajschmidt8
Cleanup config files (#11266) @vyasr
Run mypy on all packages (#11265) @vyasr
Update to isort 5.10.1. (#11262) @vyasr
Consolidate flake8 and pydocstyle configuration (#11260) @vyasr
Remove redundant black config specifications. (#11258) @vyasr
Ensure DeprecationWarnings are not introduced via pre-commit (#11255) @wence-
Optimization to gpu::PreprocessColumnData in parquet reader. (#11252) @nvdbaranec
Move rolling impl details to detail/ directory. (#11250) @mythrocks
Remove lists::drop_list_duplicates (#11236) @ttnghia
Use cudf::lists::distinct in Python binding (#11234) @ttnghia
Use cudf::lists::distinct in Java binding (#11233) @ttnghia
Use cudf::distinct in Java binding (#11232) @ttnghia
Pin dask-cuda in dev environment (#11229) @galipremsagar
Remove cruft in map_lookup (#11221) @mythrocks
Deprecate skiprows & num_rows in parquet reader (#11218) @galipremsagar
Remove Frame._index (#11210) @vyasr
Improve performance for cudf::contains when searching for a scalar (#11202) @ttnghia
Document why Development component is needing for CMake. (#11200) @vyasr
cleanup unused code in rolling_test.hpp (#11195) @karthikeyann
Standardize join internals around DataFrame (#11184) @vyasr
Move character case table declarations from src to detail (#11183) @davidwendt
Remove usage of Frame in StringMethods (#11181) @vyasr
Expose getjsonobject_options to Python (#11180) @SrikarVanavasam
Fix decimal128 stats in parquet writer (#11179) @etseidl
Modify CheckPageRows in parquet_test to use datasources (#11177) @etseidl
Pin max version of cuda-python to 11.7.0 (#11174) @Ethyling
Refactor and optimize Frame.where (#11168) @vyasr
Add npos const static member to cudf::string_view (#11166) @davidwendt
Move droprowsbylabel from Frame to IndexedFrame (#11157) @vyasr
Clean up copytype_metadata (#11156) @vyasr
Add nvcc conda package in dev environment (#11154) @galipremsagar
Struct binary comparison op functionality for spark rapids (#11153) @rwlee
Refactor inline conditionals. (#11151) @bdice
Refactor Spark hashing tests (#11145) @bdice
Add new _from_data_like_self factory (#11140) @vyasr
Update get_cucollections to use rapids-cmake (#11139) @vyasr
Remove unnecessary extra function for libcudacxx detection (#11138) @vyasr
Allow initial value for cudf::reduce and cudf::segmented_reduce. (#11137) @SrikarVanavasam
Remove Index.replace API (#11131) @vyasr
Move char-type table function declarations from src to detail (#11127) @davidwendt
Clean up repo root (#11124) @bdice
Improve print formatting of strings containing newline characters. (#11108) @nvdbaranec
Fix cudf::string_view::find() to return pos for empty string argument (#11107) @davidwendt
Forward-merge branch-22.06 to branch-22.08 (#11086) @bdice
Take iterators by value in clamp.cu. (#11084) @bdice
Performance improvements for row to column conversions (#11075) @hyperbolic2346
Remove deprecated Index methods from Frame (#11073) @vyasr
Use per-page max compressed size estimate for compression (#11066) @devavret
column to row refactor for performance (#11063) @hyperbolic2346
Include skbuild directory into build.sh clean operation (#11060) @galipremsagar
Unpin dask & distributed for development (#11058) @galipremsagar
Add support for Series.between (#11051) @galipremsagar
Fix groupby include (#11046) @bwyogatama
Regex cleanup internal reclass and reclass_device classes (#11045) @davidwendt
Remove public API of cudf.merge_sorted. (#11032) @bdice
Drop python 3.7 in code-base (#11029) @galipremsagar
Addition & integration of the integer power operator (#11025) @AtlantaPepsi
Refactor lists::contains (#11019) @ttnghia
Change build.sh to find C++ library by default and avoid shadowing CMAKE_ARGS (#11013) @vyasr
Clean up parquet unit test (#11005) @PointKernel
Add missing #pragma once to header files (#11004) @karthikeyann
Cleanup iterator.cuh and add fixed point support for scalar_optional_accessor (#10999) @ttnghia
Refactor cudf::contains (#10997) @ttnghia
Remove Arrow CUDA IPC code (#10995) @shwina
Change file extension for groupby benchmark (#10985) @ttnghia
Sort recipe include checks. (#10984) @bdice
Update cuCollections for thrust upgrade (#10983) @PointKernel
Expose row-group size options in cudf ParquetWriter (#10980) @rjzamora
Cleanup cudf::strings::detail::regex_parser class source (#10975) @davidwendt
Handle missing fields as nulls in getjsonobject() (#10970) @SrikarVanavasam
Fix license families to match all-caps expected by conda-verify. (#10931) @bdice
Include <optional> for GCC 11 compatibility. (#10927) @bdice
Enable builds with scikit-build (#10919) @vyasr
Improve distinct by using cuco::static_map::retrieve_all (#10916) @PointKernel
update cudfjni to 22.08.0-SNAPSHOT (#10910) @pxLi
Improve the capture of fatal cuda error (#10884) @sperlingxx
Cleanup regex compiler operators and operands source (#10879) @davidwendt
Buffer: make .ptr read-only (#10872) @madsbk
Configurable NaN handling in devicerowcomparators (#10870) @rwlee
Register cudf.core.groupby.Grouper objects to dask grouper_dispatch (#10838) @brandon-b-miller
Upgrade to arrow-8 (#10816) @galipremsagar
Remove getattr method in RangeIndex class (#10538) @skirui-source
Adding bins to value counts (#8247) @marlenezw

- C++
Published by GPUtester almost 4 years ago

https://github.com/rapidsai/cudf - v22.06.01

v22.06.01

- C++
Published by GPUtester almost 4 years ago

https://github.com/rapidsai/cudf - v22.06.00

🚨 Breaking Changes

Enable Zstandard decompression only when all nvcomp integrations are enabled (#10944) @vuule
Rename sliced_child to get_sliced_child. (#10885) @bdice
Add parameters to control page size in Parquet writer (#10882) @etseidl
Make cudf::test::expectcolumnsequal() to fail when comparing unsanitary lists. (#10880) @nvdbaranec
Cleanup regex compiler fixed quantifiers source (#10843) @davidwendt
Refactor cudf::contains, renaming and switching parameters role (#10802) @ttnghia
Generic serialization of all column types (#10784) @wence-
Return per-file metadata from readers (#10782) @vuule
HostColumnVectoreCore#isNull should return true for out-of-range rows (#10779) @gerashegalov
Update groupby::hash to use new row operators for keys (#10770) @PointKernel
update mangledupecols behavior in csv reader to match pandas 1.4.0 behavior (#10749) @karthikeyann
Rename CUDATRY macro to CUDFCUDATRY, rename CHECKCUDA macro to CUDFCHECKCUDA. (#10589) @bdice
Upgrade cudf to support pandas 1.4.x versions (#10584) @galipremsagar
Move binop methods from Frame to IndexedFrame and standardize the docstring (#10576) @vyasr
Add default= kwarg to .list.get() accessor method (#10547) @shwina
Remove deprecated decimal_cols_as_float in the ORC reader (#10515) @vuule
Support nvComp 2.3 if local, otherwise use nvcomp 2.2 (#10513) @robertmaynard
Fix findall_record to return empty list for no matches (#10491) @davidwendt
Namespace/Docstring Fixes for Reduction (#10471) @isVoid
Additional refactoring of hash functions (#10462) @bdice
Fix default value of str.split expand parameter. (#10457) @bdice
Remove deprecated code. (#10450) @vyasr

🐛 Bug Fixes

Fix single column MultiIndex issue in sort_index (#10957) @galipremsagar
Make SerializedTableHeader(numRows) public (#10949) @gerashegalov
Fix gcc_linux version pinning in dev environment (#10943) @galipremsagar
Fix an issue with reading raw string in cudf.read_json (#10924) @galipremsagar
Make cudf::test::expectcolumnsequal() to fail when comparing unsanitary lists. (#10880) @nvdbaranec
Fix segmented_reduce on empty column with non-empty offsets (#10876) @davidwendt
Fix dask-cudf groupby handling when grouping by all columns (#10866) @charlesbluca
Fix a bug in distinct: using nested nulls logic (#10848) @PointKernel
Fix constness / references in weak ordering operator() signatures. (#10846) @bdice
Suppress sizeof-array-div warnings in thrust found by gcc-11 (#10840) @robertmaynard
Add handling for string by-columns in dask-cudf groupby (#10830) @charlesbluca
Fix compile warning in search.cu (#10827) @davidwendt
Fix element access const correctness in hostdevice_vector (#10804) @vuule
Update cuco git tag (#10788) @PointKernel
HostColumnVectoreCore#isNull should return true for out-of-range rows (#10779) @gerashegalov
Fixing deprecation warnings in test_orc.py (#10772) @hyperbolic2346
Enable writing to s3 storage in chunked parquet writer (#10769) @galipremsagar
Fix construction of nested structs with EMPTY child (#10761) @shwina
Fix replace error when regex has only zero match quantifiers (#10760) @davidwendt
Fix an issue with onelevellist schemas in parquet reader. (#10750) @nvdbaranec
update mangledupecols behavior in csv reader to match pandas 1.4.0 behavior (#10749) @karthikeyann
Fix cupy function in notebook (#10737) @ajschmidt8
Fix fillna to retain columns when it is MultiIndex (#10729) @galipremsagar
Fix scatter for all-empty-string column case (#10724) @davidwendt
Retain series name in Series.apply (#10716) @brandon-b-miller
Correct build dir cudf-config dependency issues for static builds (#10704) @robertmaynard
Fix list of testing requirements in setup.py. (#10678) @bdice
Fix rounding to zero error in stod on very small float numbers (#10672) @davidwendt
cuco isn't a cudf dependency when we are built shared (#10662) @robertmaynard
Fix to_timestamps to support Z for %z format specifier (#10617) @davidwendt
Verify compression type in Parquet reader (#10610) @vuule
Fix struct row comparator's exception on empty structs (#10604) @sperlingxx
Fix strings strip() to accept only str Scalar for to_strip parameter (#10597) @davidwendt
Fix hasatomicsupport check in canusehash_groupby() (#10588) @jbrennan333
Revert Thrust 1.16 to Thrust 1.15 (#10586) @bdice
Fix missing RMMSTATICCUDART define when compiling JNI with static CUDA runtime (#10585) @jlowe
pin more cmake versions (#10570) @robertmaynard
Re-enable Build Metrics Report (#10562) @davidwendt
Remove statically linked CUDA runtime check in Java build (#10532) @jlowe
Fix temp data cleanup in test_text.py (#10524) @brandon-b-miller
Update pre-commit to run black 22.3.0 (#10523) @vyasr
Remove deprecated decimal_cols_as_float in the ORC reader (#10515) @vuule
Fix findall_record to return empty list for no matches (#10491) @davidwendt
Allow users to specify data types for a subset of columns in read_csv (#10484) @vuule
Fix default value of str.split expand parameter. (#10457) @bdice
Improve coverage of dask-cudf's groupby aggregation, add tests for dropna support (#10449) @charlesbluca
Allow string aggs for dask_cudf.CudfDataFrameGroupBy.aggregate (#10222) @charlesbluca
In-place updates with loc or iloc don't work correctly when the LHS has more than one column (#9918) @skirui-source

📖 Documentation

Clarify append deprecation notice. (#10930) @bdice
Use full name of GPUDirect Storage SDK in docs (#10904) @vuule
Update Dask + Pandas to Dask + cuDF path (#10897) @miguelusque
Add missing documentation in cudf/types.hpp (#10895) @karthikeyann
Add strong index iterator docs. (#10888) @bdice
spell check fixes (#10865) @karthikeyann
Add missing documentation in scalar/ headers (#10861) @karthikeyann
Remove typo in ngram documentation (#10859) @miguelusque
fix doxygen warnings (#10842) @karthikeyann
Add a library_design.md file documenting the core Python data structures and their relationship (#10817) @vyasr
Add NumPy to intersphinx references. (#10809) @bdice
Add a section to the docs that compares cuDF with Pandas (#10796) @shwina
Mention 2 cpp-reviewer requirement in pull request template (#10768) @davidwendt
Enable pydocstyle for all packages. (#10759) @bdice
Enable pydocstyle rules involving quotes (#10748) @vyasr
Revise 10 minutes notebook. (#10738) @bdice
Reorganize cuDF Python docs (#10691) @shwina
Fix sphinx/jupyter heading issue in UDF notebook (#10690) @brandon-b-miller
Migrated user guide notebooks to MyST-NB and added sphinx extension (#10685) @mmccarty
add data generation to benchmark documentation (#10677) @karthikeyann
Fix some docs build warnings (#10674) @galipremsagar
Update UDF notebook in User Guide. (#10668) @bdice
Improve User Guide docs (#10663) @bdice
Fix some docstrings formatting (#10660) @galipremsagar
Remove implementation details from apply docstrings (#10651) @brandon-b-miller
Revise CONTRIBUTING.md (#10644) @bdice
Add missing APIs to documentation. (#10643) @bdice
Use cudf.read_json as documented API name. (#10640) @bdice
Fix docstring section headings. (#10639) @bdice
Document cudf.readtext and cudf.readavro. (#10638) @bdice
Fix type-o in docstring for jsonreaderoptions (#10627) @dagardner-nv
Update guide to UDFs with notes about Series.applymap deprecation and related changes (#10607) @brandon-b-miller
Fix doxygen Modules page for cudf::lists::sequences (#10561) @davidwendt
Add Replace Backreferences section to Regex Features page (#10560) @davidwendt
Introduce deprecation policy to developer guide. (#10252) @vyasr

🚀 New Features

Enable Zstandard decompression only when all nvcomp integrations are enabled (#10944) @vuule
Handle nested types in cudf::concatenate_rows() (#10890) @nvdbaranec
Strong index types for equality comparator (#10883) @ttnghia
Add parameters to control page size in Parquet writer (#10882) @etseidl
Support for Zstandard decompression in ORC reader (#10873) @vuule
Use pre-built nvcomp 2.3 binaries by default (#10851) @robertmaynard
Support for Zstandard decompression in Parquet reader (#10847) @vuule
Add JNI support for applybooleanmask (#10812) @res-life
Segmented Min/Max for Fixed Point Types (#10794) @isVoid
Return per-file metadata from readers (#10782) @vuule
Segmented apply_boolean_mask for LIST columns (#10773) @mythrocks
Update groupby::hash to use new row operators for keys (#10770) @PointKernel
Support purging non-empty null elements from LIST/STRING columns (#10701) @mythrocks
Add detail::hash_join (#10695) @PointKernel
Persist string statistics data across multiple calls to orc chunked write (#10694) @hyperbolic2346
Add .list.astype() to cast list leaves to specified dtype (#10693) @shwina
JNI: Add generateListOffsets API (#10683) @sperlingxx
Support args in groupby apply (#10682) @brandon-b-miller
Enable segmented_gather in Java package (#10669) @sperlingxx
Add row hasher with nested column support (#10641) @devavret
Add support for numericonly in DataFrame.reduce (#10629) @martinfalisse
First step toward statistics in ORC files with chunked writes (#10567) @hyperbolic2346
Add support for struct columns to the random table generator (#10566) @vuule
Enable passing a sequence for the index argument to .list.get() (#10564) @shwina
Add python bindings for cudf::list::index_of (#10549) @ChrisJar
Add default= kwarg to .list.get() accessor method (#10547) @shwina
Add cudf.DataFrame.applymap (#10542) @brandon-b-miller
Support nvComp 2.3 if local, otherwise use nvcomp 2.2 (#10513) @robertmaynard
Add column field ID control in parquet writer (#10504) @PointKernel
Deprecate Series.applymap (#10497) @brandon-b-miller
Add option to drop cache in cuIO benchmarks (#10488) @vuule
move benchmark input generation in device in reduction nvbench (#10486) @karthikeyann
Support Segmented Min/Max Reduction on String Type (#10447) @isVoid
List element Equality comparator (#10289) @devavret
Implement all methods of groupby rank aggregation in libcudf, python (#9569) @karthikeyann
Implement DataFrame.eval using libcudf ASTs (#8022) @vyasr

🛠️ Improvements

Use conda compilers in env file (#10915) @galipremsagar
Remove C style artifacts in cuIO (#10886) @vuule
Rename sliced_child to get_sliced_child. (#10885) @bdice
Replace defaulted stream value for libcudf APIs that use NVCOMP (#10877) @jbrennan333
Add more unit tests for cudf::distinct for nested types with sliced input (#10860) @ttnghia
Changing list_view.cuh to list_view.hpp (#10854) @ttnghia
More error checking in from_dlpack (#10850) @wence-
Cleanup regex compiler fixed quantifiers source (#10843) @davidwendt
Adds the JNI call for Cuda.deviceSynchronize (#10839) @abellina
Add missing cuda-python dependency to cudf (#10833) @bdice
Change std::string parameters in cudf::strings APIs to std::string_view (#10832) @davidwendt
Split up search.cu to improve compile time (#10831) @davidwendt
Add tests for null scalar binaryops (#10828) @brandon-b-miller
Cleanup regex compile optimize functions (#10825) @davidwendt
Use ThreadedMotoServer instead of subprocess in spinning up s3 server (#10822) @galipremsagar
Import NA from missing rather than using cudf.NA everywhere (#10821) @brandon-b-miller
Refactor regex builtin character-class identifiers (#10814) @davidwendt
Change pattern parameter for regex APIs from std::string to std::string_view (#10810) @davidwendt
Make the JNI API to get list offsets as a view public. (#10807) @revans2
Add cudf JNI docker build github action (#10806) @pxLi
Removed mr parameter from inplace bitmask operations (#10805) @AtlantaPepsi
Refactor cudf::contains, renaming and switching parameters role (#10802) @ttnghia
Handle closed property in IntervalDtype.from_pandas (#10798) @wence-
Return weak orderings from device_row_comparator. (#10793) @rwlee
Rework Scalar imports (#10791) @brandon-b-miller
Enable ccache for cudfjni build in Docker (#10790) @gerashegalov
Generic serialization of all column types (#10784) @wence-
simplifying skiprows test in test_orc.py (#10783) @hyperbolic2346
Use columnviews instead of columndevice_views in binary operations. (#10780) @bdice
Add struct utility functions. (#10776) @bdice
Add multiple rows to subword tokenizer benchmark (#10767) @davidwendt
Refactor host decompression in ORC reader (#10764) @vuule
Flush output streams before creating a process to drop caches (#10762) @vuule
Refactor binaryop/compiled/util.cpp (#10756) @bdice
Use warp per string for long strings in cudf::strings::contains() (#10739) @davidwendt
Use generator expressions in any/all functions. (#10736) @bdice
Use canonical "magic methods" (replace x.__repr__() with repr(x)). (#10735) @bdice
Improve use of isinstance. (#10734) @bdice
Rename tests from multiIndex to multiindex. (#10732) @bdice
Two-table comparators with strong index types (#10730) @bdice
Replace std::make_pair with std::pair (C++17 CTAD) (#10727) @karthikeyann
Use structured bindings instead of std::tie (#10726) @karthikeyann
Missing f prefix on f-strings fix (#10721) @code-review-doctor
Add max_file_size parameter to chunked parquet dataset writer (#10718) @galipremsagar
Deprecate merge_sorted, change dask cudf usage to internal method (#10713) @isVoid
Prepare daskcudf testparquet.py for upcoming API changes (#10709) @rjzamora
Remove or simplify various utility functions (#10705) @vyasr
Allow building arrow with parquet and not python (#10702) @revans2
Partial cuIO GPU decompression refactor (#10699) @vuule
Cython API refactor: merge.pyx (#10698) @isVoid
Fix random string data length to become variable (#10697) @galipremsagar
Add bindings for index_of with column search key (#10696) @ChrisJar
Deprecate index merging (#10689) @vyasr
Remove cudf::strings::string namespace (#10684) @davidwendt
Standardize imports. (#10680) @bdice
Standardize usage of collections.abc. (#10679) @bdice
Cython API Refactor: transpose.pyx, sort.pyx (#10675) @isVoid
Add devicememoryresource parameter to createstringvectorfromcolumn (#10673) @davidwendt
Split up mixed-join kernels source files (#10671) @davidwendt
Use std::filesystem for temporary directory location and deletion (#10664) @vuule
cleanup benchmark includes (#10661) @karthikeyann
Use upstream clang-format pre-commit hook. (#10659) @bdice
Clean up C++ includes to use <> instead of "". (#10658) @bdice
Handle RuntimeError thrown by CUDA Python in validate_setup (#10653) @shwina
Rework JNI CMake to leverage rapidsfindpackage (#10649) @jlowe
Use conda to build python packages during GPU tests (#10648) @Ethyling
Deprecate various functions that don't need to be defined for Index. (#10647) @vyasr
Update pinning to allow newer CMake versions. (#10646) @vyasr
Bump hadoop-common from 3.1.4 to 3.2.3 in /java (#10645) @dependabot[bot]
Remove concurrent_unordered_multimap. (#10642) @bdice
Improve parquet dictionary encoding (#10635) @PointKernel
Improve cudf::cuda_error (#10630) @sperlingxx
Add support for null and non-numeric types in Series.diff and DataFrame.diff (#10625) @Matt711
Branch 22.06 merge 22.04 (#10624) @vyasr
Unpin dask & distributed for development (#10623) @galipremsagar
Slightly improve accuracy of stod in to_floats (#10622) @davidwendt
Allow libcudfjni to be built as a static library (#10619) @jlowe
Change stack-based regex state data to use global memory (#10600) @davidwendt
Resolve Forward merging of branch-22.04 into branch-22.06 (#10598) @galipremsagar
KvikIO as an alternative GDS backend (#10593) @madsbk
Rename CUDATRY macro to CUDFCUDATRY, rename CHECKCUDA macro to CUDFCHECKCUDA. (#10589) @bdice
Upgrade cudf to support pandas 1.4.x versions (#10584) @galipremsagar
Refactor binary ops for timedelta and datetime columns (#10581) @vyasr
Refactor cudf::strings::countre API to use countmatches utility (#10580) @davidwendt
Update Programming Language :: Python Versions to 3.8 & 3.9 (#10579) @madsbk
Automate Java cudf jar build with statically linked dependencies (#10578) @gerashegalov
Add patch for thrust-cub 1.16 to fix sort compile times (#10577) @davidwendt
Move binop methods from Frame to IndexedFrame and standardize the docstring (#10576) @vyasr
Cleanup libcudf strings regex classes (#10573) @davidwendt
Simplify preprocessing of arguments for DataFrame binops (#10563) @vyasr
Reduce kernel calls to build strings findall results (#10559) @davidwendt
Forward-merge branch-22.04 to branch-22.06 (#10557) @bdice
Update strings contains benchmark to measure varying match rates (#10555) @davidwendt
JNI: throw CUDA errors more specifically (#10551) @sperlingxx
Enable building static libs (#10545) @trxcllnt
Remove pip requirements files. (#10543) @bdice
Remove Click pinnings that are unnecessary after upgrading black. (#10541) @vyasr
Refactor memory_usage to improve performance (#10537) @galipremsagar
Adjust the valid range of group index for replacewithbackrefs (#10530) @sperlingxx
add accidentally removed comment. (#10526) @vyasr
Update conda environment. (#10525) @vyasr
Remove ColumnBase.getitem (#10516) @vyasr
Optimize left_semi_join by materializing the gather mask (#10511) @cheinger
Define proper binary operation APIs for columns (#10509) @vyasr
Upgrade arrow-cpp & pyarrow to 7.0.0 (#10503) @galipremsagar
Update to Thrust 1.16 (#10489) @bdice
Namespace/Docstring Fixes for Reduction (#10471) @isVoid
Update cudfjni 22.06.0-SNAPSHOT (#10467) @pxLi
Use Lists of Columns for Various Files (#10463) @isVoid
Additional refactoring of hash functions (#10462) @bdice
Fix Series.str.findall behavior for expand=False. (#10459) @bdice
Remove deprecated code. (#10450) @vyasr
Update cmake-format version. (#10440) @vyasr
Consolidate C++ conda recipes and add libcudf-tests package (#10326) @ajschmidt8
Use conda compilers (#10275) @Ethyling
Add row bitmask as a detail::hash_join member (#10248) @PointKernel

- C++
Published by GPUtester almost 4 years ago

https://github.com/rapidsai/cudf - v22.04.00

🚨 Breaking Changes

Drop unsupported method argument from nunique and distinct_count. (#10411) @bdice
Refactor stream compaction APIs (#10370) @PointKernel
Add scanaggregation and reduceaggregation derived types. (#10357) @nvdbaranec
Avoid decimal type narrowing for decimal binops (#10299) @galipremsagar
Rewrites sample API (#10262) @isVoid
Remove probe-time null equality parameters in cudf::hash_join (#10260) @PointKernel
Enable proper Index round-tripping in orc reader and writer (#10170) @galipremsagar
Add JNI for strings::split_re and strings::split_record_re (#10139) @ttnghia
Change cudf::strings::find_multiple to return a lists column (#10134) @davidwendt
Remove the option to completely disable decimal128 columns in the ORC reader (#10127) @vuule
Remove deprecated code (#10124) @vyasr
Update gpu_utils.py to reflect current CUDA support. (#10113) @bdice
Optimize compaction operations (#10030) @PointKernel
Remove deprecated method Series.set_index. (#9945) @bdice
Add cudf::strings::findall_record API (#9911) @davidwendt
Upgrade arrow & pyarrow to 6.0.1 (#9686) @galipremsagar

🐛 Bug Fixes

Fix an issue with tdigest merge aggregations. (#10506) @nvdbaranec
Batch of fixes for index overflows in grid stride loops. (#10448) @nvdbaranec
Update dask_cudf imports to be compatible with latest dask (#10442) @rlratzel
Fix for integer overflow in contiguous-split (#10437) @jbrennan333
Fix hasnull predicate for droplist_duplicates on nested structs (#10436) @sperlingxx
Fix empty reduce with List output and non-List input (#10435) @sperlingxx
Fix list and struct meta generation issue in dask-cudf (#10434) @galipremsagar
Fix error in cudf.to_numeric when a bool input is passed (#10431) @galipremsagar
Support cupy array in quantile input (#10429) @galipremsagar
Fix benchmarks to work with new aggregation types (#10428) @davidwendt
Fix cudf::shift to handle offset greater than column size (#10414) @davidwendt
Fix lifespan of the temporary directory that holds cuFile configuration file (#10403) @vuule
Fix error thrown in compiled-binaryop benchmark (#10398) @davidwendt
Limiting async allocator using alignment of 512 (#10395) @rongou
Include <optional> in multibyte split. (#10385) @bdice
Fix issue with column and scalar re-assignment (#10377) @galipremsagar
Fix floating point data generation in benchmarks (#10372) @vuule
Avoid overflow in fusedconcatenatekernel output_index (#10344) @abellina
Remove isrelationallycomparable for table device views (#10342) @davidwendt
Fix debug compile error in devicespan to columnview conversion (#10331) @davidwendt
Add Pascal support to JCUDF transcode (row_conversion) (#10329) @mythrocks
Fix std::bad_alloc exception due to JIT reserving a huge buffer (#10317) @ttnghia
Fixes up the overflowed fixed-point round on nullable column (#10316) @sperlingxx
Fix DataFrame slicing issues for empty cases (#10310) @brandon-b-miller
Fix documentation issues (#10307) @ajschmidt8
Allow Java bindings to use default decimal precisions when writing columns (#10276) @sperlingxx
Fix incorrect slicing of GDS read/write calls (#10274) @vuule
Fix out-of-memory error in compiled-binaryop benchmark (#10269) @davidwendt
Add tests of reflected ufuncs and fix behavior of logical reflected ufuncs (#10261) @vyasr
Remove probe-time null equality parameters in cudf::hash_join (#10260) @PointKernel
Fix out-of-memory error in UrlDecode benchmark (#10258) @davidwendt
Fix groupby reductions that perform operations on source type instead of target type (#10250) @ttnghia
Fix small leak in explode (#10245) @revans2
Yet another small JNI memory leak (#10238) @revans2
Fix regex octal parsing to limit to 3 characters (#10233) @davidwendt
Fix string to decimal128 conversion handling large exponents (#10231) @davidwendt
Fix JNI leak on copy to device (#10229) @revans2
Fix the data generator element size for decimal types (#10225) @vuule
Fix decimal metadata in parquet writer (#10224) @galipremsagar
Fix strings handling of hex in regex pattern (#10220) @davidwendt
Fix docs builds (#10216) @ajschmidt8
Fix a leftover hasnulls change from Nullate (#10211) @devavret
Fix bitmask of the output for JNI of lists::drop_list_duplicates (#10210) @ttnghia
Fix compile error in binaryop/compiled/util.cpp (#10209) @ttnghia
Skip ORC and Parquet readers' benchmark cases that are not currently supported (#10194) @vuule
Fix JNI leak of a cudf::column_view native class. (#10171) @revans2
Enable proper Index round-tripping in orc reader and writer (#10170) @galipremsagar
Convert Column Name to String Before Using Struct Column Factory (#10156) @isVoid
Preserve the correct ListDtype while creating an identical empty column (#10151) @galipremsagar
benchmark fixture - static object pointer fix (#10145) @karthikeyann
Fix UDF Caching (#10133) @brandon-b-miller
Raise duplicate column error in DataFrame.rename (#10120) @galipremsagar
Fix flaky memory usage test by guaranteeing array size. (#10114) @vyasr
Encode values from python callback for C++ (#10103) @jdye64
Add check for regex instructions causing an infinite-loop (#10095) @davidwendt
Remove metadata singleton from nvtext normalizer (#10090) @davidwendt
Column equality testing fixes (#10011) @brandon-b-miller
Pin libcudf runtime dependency for cudf / libcudf-kafka nightlies (#9847) @charlesbluca

📖 Documentation

Fix documentation for DataFrame.corr and Series.corr. (#10493) @bdice
Add cut to API docs (#10479) @shwina
Remove documentation for methods removed in #10124. (#10366) @bdice
Fix documentation issues (#10306) @ajschmidt8
Fix fixed_point binary operation documentation (#10198) @codereport
Remove cleaned up methods from docs (#10189) @galipremsagar
Update developer guide to recommend no default stream parameter. (#10136) @bdice
Update benchmarking guide to use NVBench. (#10093) @bdice

🚀 New Features

Add StringIO support to read_text (#10465) @cwharris
Add support for tdigest and merge_tdigest aggregations through cudf::reduce (#10433) @nvdbaranec
JNI support for Collect Ops in Reduction (#10427) @sperlingxx
Enable readtext with daskcudf using byte_range (#10407) @ChrisJar
Add cudf::stable_sort_by_key (#10387) @PointKernel
Implement maps_column_view abstraction over LIST<STRUCT<K,V>> (#10380) @mythrocks
Support Java bindings for Avro reader (#10373) @HaoYang670
Refactor stream compaction APIs (#10370) @PointKernel
Support collect aggregations in reduction (#10353) @sperlingxx
Refactor array_ufunc for Index and unify across all classes (#10346) @vyasr
Add JNI for extractlistelement with index column (#10341) @firestarman
Support min and max operations for structs in rolling window (#10332) @ttnghia
Add device createsequencetable for benchmarks (#10300) @karthikeyann
Enable numpy ufuncs for DataFrame (#10287) @vyasr
move input generation for json benchmark to device (#10281) @karthikeyann
move input generation for type dispatcher benchmark to device (#10280) @karthikeyann
move input generation for copy benchmark to device (#10279) @karthikeyann
generate url decode benchmark input in device (#10278) @karthikeyann
device input generation in join bench (#10277) @karthikeyann
Add nvtext::bytepairencoding API (#10270) @davidwendt
Prevent internal usage of expensive APIs (#10263) @vyasr
Column to JCUDF row for tables with strings (#10235) @hyperbolic2346
Support percent_rank() aggregation (#10227) @mythrocks
Refactor Series.array_ufunc (#10217) @vyasr
Reduce pytest runtime (#10203) @brandon-b-miller
Add regex flags parameter to python cudf strings split (#10185) @davidwendt
Support for MOD, PMOD and PYMOD for decimal32/64/128 (#10179) @codereport
Adding string row size iterator for row to column and column to row conversion (#10157) @hyperbolic2346
Add file size counter to cuIO benchmarks (#10154) @vuule
byterange support for multibytesplit/read_text (#10150) @cwharris
Add JNI for strings::split_re and strings::split_record_re (#10139) @ttnghia
Add maxSplit parameter to Java binding for strings:split (#10137) @ttnghia
Add libcudf strings split API that accepts regex pattern (#10128) @davidwendt
generate benchmark input in device (#10109) @karthikeyann
Avoid nan_as_null op if nan_count is 0 (#10082) @galipremsagar
Add Dataframe and Index nunique (#10077) @martinfalisse
Support nanosecond timestamps in parquet (#10063) @PointKernel
Java bindings for mixed semi and anti joins (#10040) @jlowe
Implement mixed equality/conditional semi/anti joins (#10037) @vyasr
Optimize compaction operations (#10030) @PointKernel
Support args= in Series.apply (#9982) @brandon-b-miller
Add cudf::strings::findall_record API (#9911) @davidwendt
Add covariance for sort groupby (python) (#9889) @mayankanand007
Implement DataFrame diff() (#9817) @skirui-source
Implement DataFrame pct_change (#9805) @skirui-source
Support segmented reductions and null mask reductions (#9621) @isVoid
Add 'spearman' correlation method for dataframe.corr and series.corr (#7141) @dominicshanshan

🛠️ Improvements

Add scipy skip for a test (#10502) @galipremsagar
Temporarily disable new ops-bot functionality (#10496) @ajschmidt8
Include <cstddef> to fix compilation of parquet reader on GCC 11. (#10483) @bdice
Pin dask and distributed (#10481) @galipremsagar
MD5 refactoring. (#10445) @bdice
Remove or split up Frame methods that use the index (#10439) @vyasr
Centralization of tdigest aggregation code. (#10422) @nvdbaranec
Simplify column binary operations (#10421) @vyasr
Add .github/ops-bot.yaml config file (#10420) @ajschmidt8
Use list of columns for methods in Groupby.pyx (#10419) @isVoid
Remove warnings in test_timedelta.py (#10418) @galipremsagar
Fix some warnings in test_parquet.py (#10416) @galipremsagar
JNI support for segmented reduce (#10413) @revans2
Clean up null mask after purging null entries (#10412) @sperlingxx
Drop unsupported method argument from nunique and distinct_count. (#10411) @bdice
Use str instead of builtins.str. (#10410) @bdice
Fix warnings in test_rolling (#10405) @bdice
Enable codecov github-check in CI (#10404) @galipremsagar
Fix warnings in testcudaapply, testnumerical, testpickling, test_unaops. (#10402) @bdice
Set column names in _from_columns_like_self factory (#10400) @isVoid
Refactor nvtx annotations in cudf & dask-cudf (#10396) @galipremsagar
Consolidate .cov and .corr for sort groupby (#10386) @skirui-source
Consolidate some Frame APIs (#10381) @vyasr
Refactor hash functions and hash_combine (#10379) @bdice
Add nvtx annotations for Series and Index (#10374) @galipremsagar
Refactor filling.repeat API (#10371) @isVoid
Move standalone UTF8 functions from string_view.hpp to utf8.hpp (#10369) @davidwendt
Remove doc for deprecated function one_hot_encoding (#10367) @isVoid
Refactor array function (#10364) @vyasr
Fix warnings in test_csv.py. (#10362) @bdice
Implement a mixin for binops (#10360) @vyasr
Refactor cython interface: copying.pyx (#10359) @isVoid
Implement a mixin for scans (#10358) @vyasr
Add scanaggregation and reduceaggregation derived types. (#10357) @nvdbaranec
Add cleanup of python artifacts (#10355) @galipremsagar
Fix warnings in test_categorical.py. (#10354) @bdice
Create a dispatcher for invoking regex kernel functions (#10349) @davidwendt
Fix codecov in CI (#10347) @galipremsagar
Enable caching for memory_usage calculation in Column (#10345) @galipremsagar
C++17 cleanup: traits replace std::enableif<>::type with std::enableif_t (#10343) @karthikeyann
JNI: Support appending DECIMAL128 into ColumnBuilder in terms of byte array (#10338) @sperlingxx
multibyte_split test improvements (#10328) @vuule
Fix warnings in test_binops.py. (#10327) @bdice
Fix warnings from pandas in testarrayufunc.py. (#10324) @bdice
Update upload script (#10321) @ajschmidt8
Move hash type declarations to hashing.hpp (#10320) @davidwendt
C++17 cleanup: traits replace ::value with _v (#10319) @karthikeyann
Remove internal columns usage (#10315) @vyasr
Remove extraneous build.sh parameter (#10313) @ajschmidt8
Add const qualifier to MurmurHash332::hashcombine (#10311) @davidwendt
Remove TODO in libcudf_kafka recipe (#10309) @ajschmidt8
Add conversions between columnview and devicespan<T const>. (#10302) @bdice
Avoid decimal type narrowing for decimal binops (#10299) @galipremsagar
Deprecate DataFrame.iteritems and introduce .items (#10298) @galipremsagar
Explicitly request CMake use gnu++17 over c++17 (#10297) @robertmaynard
Add copyright check as pre-commit hook. (#10290) @vyasr
DataFrame insert and creation optimizations (#10285) @galipremsagar
Improve hash join detail functions (#10273) @PointKernel
Replace custom cached_property implementation with functools (#10272) @shwina
Rewrites sample API (#10262) @isVoid
Bump hadoop-common from 3.1.0 to 3.1.4 in /java (#10259) @dependabot[bot]
Remove making redundant copy across code-base (#10257) @galipremsagar
Add more nvtx annotations (#10256) @galipremsagar
Add copyright check in cudf (#10253) @galipremsagar
Remove redundant copies in fillna to improve performance (#10241) @galipremsagar
Remove std::numeric_limit specializations for timestamp & durations (#10239) @codereport
Optimize DataFrame creation across code-base (#10236) @galipremsagar
Change pytest distribution algorithm and increase parallelism in CI (#10232) @galipremsagar
Add environment variables for I/O thread pool and slice sizes (#10218) @vuule
Add regex flags to strings findall functions (#10208) @davidwendt
Update dask-cudf parquet tests to reflect upstream bugfixes to _metadata (#10206) @charlesbluca
Remove unnecessary nunique function in Series. (#10205) @martinfalisse
Refactor DataFrame tests. (#10204) @bdice
Rewrites column.__setitem__, Use boolean_mask_scatter (#10202) @isVoid
Java utilities to aid in accelerating aggregations on 128-bit types (#10201) @jlowe
Fix docstrings alignment in Frame methods (#10199) @galipremsagar
Fix cuco pair issue in hash join (#10195) @PointKernel
Replace dask groupby .index usages with .by (#10193) @galipremsagar
Add regex flags to strings extract function (#10192) @davidwendt
Forward-merge branch-22.02 to branch-22.04 (#10191) @bdice
Add CMake install rule for tests (#10190) @ajschmidt8
Unpin dask & distributed (#10182) @galipremsagar
Add comments to explain test validation (#10176) @galipremsagar
Reduce warnings in pytest output (#10168) @bdice
Some consolidation of indexed frame methods (#10167) @vyasr
Refactor isin implementations (#10165) @vyasr
Faster struct row comparator (#10164) @devavret
Refactor groupby::get_groups. (#10161) @bdice
Deprecate decimal_cols_as_float in ORC reader (C++ layer) (#10152) @vuule
Replace ccache with sccache (#10146) @ajschmidt8
Murmur3 hash kernel cleanup (#10143) @rwlee
Deprecate decimal_cols_as_float in ORC reader (#10142) @galipremsagar
Run pyupgrade 2.31.0. (#10141) @bdice
Remove drop_nan from internal IndexedFrame._drop_na_rows. (#10140) @bdice
Change cudf::strings::find_multiple to return a lists column (#10134) @davidwendt
Update cmake-format script for branch 22.04. (#10132) @bdice
Accept r-value references in converttablefor_return(): (#10131) @mythrocks
Remove the option to completely disable decimal128 columns in the ORC reader (#10127) @vuule
Remove deprecated code (#10124) @vyasr
Update gpu_utils.py to reflect current CUDA support. (#10113) @bdice
Remove benchmarks suffix (#10112) @bdice
Update cudf java binding version to 22.04.0-SNAPSHOT (#10084) @pxLi
Remove unnecessary docker files. (#10069) @vyasr
Limit benchmark iterations using environment variable (#10060) @karthikeyann
Add timing chart for libcudf build metrics report page (#10038) @davidwendt
JNI: Rewrite growBuffersAndRows to accelerate the HostColumnBuilder (#10025) @sperlingxx
Reduce redundant code in CUDF JNI (#10019) @mythrocks
Make snappy decompress check more efficient (#9995) @cheinger
Remove deprecated method Series.set_index. (#9945) @bdice
Implement a mixin for reductions (#9925) @vyasr
JNI: Push back decimal utils from spark-rapids (#9907) @sperlingxx
Add assert_column_memory_* (#9882) @isVoid
Add CUDF_UNREACHABLE macro. (#9727) @bdice
Upgrade arrow & pyarrow to 6.0.1 (#9686) @galipremsagar

- C++
Published by GPUtester about 4 years ago

https://github.com/rapidsai/cudf - v22.02.00

🚨 Breaking Changes

ORC writer API changes for granular statistics (#10058) @mythrocks
decimal128 Support for to/from_arrow (#9986) @codereport
Remove deprecated method one_hot_encoding (#9977) @isVoid
Remove str.subword_tokenize (#9968) @VibhuJawa
Remove deprecated method parameter from merge and join. (#9944) @bdice
Remove deprecated method DataFrame.hash_columns. (#9943) @bdice
Remove deprecated method Series.hash_encode. (#9942) @bdice
Refactoring ceil/round/floor code for datetime64 types (#9926) @mayankanand007
Introduce nan_as_null parameter for cudf.Index (#9893) @galipremsagar
Add regexflags parameter to strings replacere functions (#9878) @davidwendt
Break tie for top categorical columns in Series.describe (#9867) @isVoid
Add partitioning support in parquet writer (#9810) @devavret
Move drop_duplicates, drop_na, _gather, take to IndexFrame and create their _base_index counterparts (#9807) @isVoid
Raise temporary error for decimal128 types in parquet reader (#9804) @galipremsagar
Change default dtype of all nulls column from float to object (#9803) @galipremsagar
Remove unused masked udf cython/c++ code (#9792) @brandon-b-miller
Pick smallest decimal type with required precision in ORC reader (#9775) @vuule
Add decimal128 support to Parquet reader and writer (#9765) @vuule
Refactor TableTest assertion methods to a separate utility class (#9762) @jlowe
Use cuFile direct device reads/writes by default in cuIO (#9722) @vuule
Match pandas scalar result types in reductions (#9717) @brandon-b-miller
Add parameters to control row group size in Parquet writer (#9677) @vuule
Refactor bit counting APIs, introduce valid/null count functions, and split host/device side code for segmented counts. (#9588) @bdice
Add support for decimal128 in cudf python (#9533) @galipremsagar
Implement lists::index_of() to find positions in list rows (#9510) @mythrocks
Rewriting row/column conversions for Spark <-> cudf data conversions (#8444) @hyperbolic2346

🐛 Bug Fixes

Add check for negative stripe index in ORC reader (#10074) @vuule
Update Java tests to expect DECIMAL128 from Arrow (#10073) @jlowe
Avoid index materialization when DataFrame is created with un-named Series objects (#10071) @galipremsagar
fix gcc 11 compilation errors (#10067) @rongou
Fix columns ordering issue in parquet reader (#10066) @galipremsagar
Fix dataframe setitem with ndarray types (#10056) @galipremsagar
Remove implicit copy due to conversion from cudf::sizetype and sizet (#10045) @robertmaynard
Include <optional> in headers that use std::optional (#10044) @robertmaynard
Fix repr and concat of StructColumn (#10042) @galipremsagar
Include row group level stats when writing ORC files (#10041) @vuule
build.sh respects the --build_metrics and --incl_cache_stats flags (#10035) @robertmaynard
Fix memory leaks in JNI native code. (#10029) @mythrocks
Update JNI to use new arena mr constructor (#10027) @rongou
Fix null check when comparing structs in arg_min operation of reduction/groupby (#10026) @ttnghia
Wrap CI script shell variables in quotes to fix local testing. (#10018) @bdice
cudftestutil no longer propagates compiler flags to external users (#10017) @robertmaynard
Remove CUDA_DEVICE_CALLABLE macro usage (#10015) @hyperbolic2346
Add missing list filling header in meta.yaml (#10007) @devavret
Fix conda recipes for custreamz & cudf_kafka (#10003) @ajschmidt8
Fix matching regex word-boundary (\b) in strings replace (#9997) @davidwendt
Fix null check when comparing structs in min and max reduction/groupby operations (#9994) @ttnghia
Fix octal pattern matching in regex string (#9993) @davidwendt
decimal128 Support for to/from_arrow (#9986) @codereport
Fix groupby shift/diff/fill after selecting from a GroupBy (#9984) @shwina
Fix the overflow problem of decimal rescale (#9966) @sperlingxx
Use default value for decimal precision in parquet writer when not specified (#9963) @devavret
Fix cudf java build error. (#9958) @firestarman
Use gpucimambaretry to install local artifacts. (#9951) @bdice
Fix regression HostColumnVectorCore requiring native libs (#9948) @jlowe
Rename aggregate_metadata in writer to fix name collision (#9938) @devavret
Fixed issue with percentile_approx where output tdigests could have uninitialized data at the end. (#9931) @nvdbaranec
Resolve racecheck errors in ORC kernels (#9916) @vuule
Fix the java build after parquet partitioning support (#9908) @revans2
Fix compilation of benchmark for parquet writer. (#9905) @bdice
Fix a memcheck error in ORC writer (#9896) @vuule
Introduce nan_as_null parameter for cudf.Index (#9893) @galipremsagar
Fix fallback to sort aggregation for grouping only hash aggregate (#9891) @abellina
Add zlib to cudfjni link when using static libcudf library dependency (#9890) @jlowe
TimedeltaIndex constructor raises an AttributeError. (#9884) @skirui-source
Fix cudf.Scalar string datetime construction (#9875) @brandon-b-miller
Load libcufile.so with RTLD_NODELETE flag (#9872) @vuule
Break tie for top categorical columns in Series.describe (#9867) @isVoid
Fix null handling for structs min and arg_min in groupby, groupby scan, reduction, and inclusive_scan (#9864) @ttnghia
Add one-level list encoding support in parquet reader (#9848) @PointKernel
Fix an out-of-bounds read in validity copying in contiguous_split. (#9842) @nvdbaranec
Fix join of MultiIndex to Index with one column and overlapping name. (#9830) @vyasr
Fix caching in Series.applymap (#9821) @brandon-b-miller
Enforce boolean ascending for dask-cudf sort_values (#9814) @charlesbluca
Fix ORC writer crash with empty input columns (#9808) @vuule
Change default dtype of all nulls column from float to object (#9803) @galipremsagar
Load native dependencies when Java ColumnView is loaded (#9800) @jlowe
Fix dtype-argument bug in daskcudf readcsv (#9796) @rjzamora
Fix overflow for min calculation in strings::from_timestamps (#9793) @revans2
Fix memory error due to lambda return type deduction limitation (#9778) @karthikeyann
Revert regex $/EOL end-of-string new-line special case handling (#9774) @davidwendt
Fix missing streams (#9767) @karthikeyann
Fix makeemptyscalarlike on listtype (#9759) @sperlingxx
Update cmake and conda to 22.02 (#9746) @devavret
Fix out-of-bounds memory write in decimal128-to-string conversion (#9740) @davidwendt
Match pandas scalar result types in reductions (#9717) @brandon-b-miller
Fix regex non-multiline EOL/$ matching strings ending with a new-line (#9715) @davidwendt
Fixed build by adding more checks for int8, int16 (#9707) @razajafri
Fix null handling when boolean dtype is passed (#9691) @galipremsagar
Fix stream usage in segmented_gather() (#9679) @mythrocks

📖 Documentation

Update decimal dtypes related docs entries (#10072) @galipremsagar
Fix regex doc describing hexadecimal escape characters (#10009) @davidwendt
Fix cudf compilation instructions. (#9956) @esoha-nvidia
Fix see also links for IO APIs (#9895) @galipremsagar
Fix build instructions for libcudf doxygen (#9837) @davidwendt
Fix some doxygen warnings and add missing documentation (#9770) @karthikeyann
update cuda version in local build (#9736) @karthikeyann
Fix doxygen for enum types in libcudf (#9724) @davidwendt
Spell check fixes (#9682) @karthikeyann
Fix links in C++ Developer Guide. (#9675) @bdice

🚀 New Features

Remove libcudacxx patch needed for nvcc 11.4 (#10057) @robertmaynard
Allow CuPy 10 (#10048) @jakirkham
Add in support for NULLLOGICALAND and NULLLOGICALOR binops (#10016) @revans2
Add groupby.transform (only support for aggregations) (#10005) @shwina
Add partitioning support to Parquet chunked writer (#10000) @devavret
Add jni for sequences (#9972) @wbo4958
Java bindings for mixed left, inner, and full joins (#9941) @jlowe
Java bindings for JSON reader support (#9940) @wbo4958
Enable transpose for string columns in cudf python (#9937) @galipremsagar
Support structs for cudf::contains with column/scalar input (#9929) @ttnghia
Implement mixed equality/conditional joins (#9917) @vyasr
Add cudf::strings::extract_all API (#9909) @davidwendt
Implement JNI for cudf::scatter APIs (#9903) @ttnghia
JNI: Function to copy and set validity from bool column. (#9901) @mythrocks
Add dictionary support to cudf::copyifelse (#9887) @davidwendt
add run_benchmarks target for running benchmarks with json output (#9879) @karthikeyann
Add regexflags parameter to strings replacere functions (#9878) @davidwendt
Addsuffix and addprefix for DataFrames and Series (#9846) @mayankanand007
Add JNI for cudf::drop_duplicates (#9841) @ttnghia
Implement per-list sequence (#9839) @ttnghia
adding series.transpose (#9835) @mayankanand007
Adding support for Series.autocorr (#9833) @mayankanand007
Support round operation on datetime64 datatypes (#9820) @mayankanand007
Add partitioning support in parquet writer (#9810) @devavret
Raise temporary error for decimal128 types in parquet reader (#9804) @galipremsagar
Add decimal128 support to Parquet reader and writer (#9765) @vuule
Optimize groupby::scan (#9754) @PointKernel
Add sample JNI API (#9728) @res-life
Support min and max in inclusive scan for structs (#9725) @ttnghia
Add first and last method to IndexedFrame (#9710) @isVoid
Support min and max reduction for structs (#9697) @ttnghia
Add parameters to control row group size in Parquet writer (#9677) @vuule
Run compute-sanitizer in nightly build (#9641) @karthikeyann
Implement Series.datetime.floor (#9571) @skirui-source
ceil/floor for DatetimeIndex (#9554) @mayankanand007
Add support for decimal128 in cudf python (#9533) @galipremsagar
Implement lists::index_of() to find positions in list rows (#9510) @mythrocks
custreamz oauth callback for kafka (librdkafka) (#9486) @jdye64
Add Pearson correlation for sort groupby (python) (#9166) @skirui-source
Interchange dataframe protocol (#9071) @iskode
Rewriting row/column conversions for Spark <-> cudf data conversions (#8444) @hyperbolic2346

🛠️ Improvements

Prepare upload scripts for Python 3.7 removal (#10092) @Ethyling
Simplify custreamz and cudf_kafka recipes files (#10065) @Ethyling
ORC writer API changes for granular statistics (#10058) @mythrocks
Remove python constraints in cutreamz and cudf_kafka recipes (#10052) @Ethyling
Unpin dask and distributed in CI (#10028) @galipremsagar
Add _from_column_like_self factory (#10022) @isVoid
Replace custom CUDA bindings previously provided by RMM with official CUDA Python bindings (#10008) @shwina
Use cuda::std::is_arithmetic in cudf::is_numeric trait. (#9996) @bdice
Clean up CUDA stream use in cuIO (#9991) @vuule
Use addressed-ordered first fit for the pinned memory pool (#9989) @rongou
Add strings tests to transpose_test.cpp (#9985) @davidwendt
Use gpucimambaretry on Java CI. (#9983) @bdice
Remove deprecated method one_hot_encoding (#9977) @isVoid
Minor cleanup of unused Python functions (#9974) @vyasr
Use new efficient partitioned parquet writing in cuDF (#9971) @devavret
Remove str.subword_tokenize (#9968) @VibhuJawa
Forward-merge branch-21.12 to branch-22.02 (#9947) @bdice
Remove deprecated method parameter from merge and join. (#9944) @bdice
Remove deprecated method DataFrame.hash_columns. (#9943) @bdice
Remove deprecated method Series.hash_encode. (#9942) @bdice
use ninja in java ci build (#9933) @rongou
Add build-time publish step to cpu build script (#9927) @davidwendt
Refactoring ceil/round/floor code for datetime64 types (#9926) @mayankanand007
Remove various unused functions (#9922) @vyasr
Raise in query if dtype is not supported (#9921) @brandon-b-miller
Add missing imports tests (#9920) @Ethyling
Spark Decimal128 hashing (#9919) @rwlee
Replace thrust/std::get with structured bindings (#9915) @codereport
Upgrade thrust version to 1.15 (#9912) @robertmaynard
Remove conda envs for CUDA 11.0 and 11.2. (#9910) @bdice
Return count of set bits from inplacebitmaskand. (#9904) @bdice
Use dynamic nullate for join hasher and equality comparator (#9902) @davidwendt
Update ucx-py version on release using rvc (#9897) @Ethyling
Remove IncludeCategories from .clang-format (#9876) @codereport
Support statically linking CUDA runtime for Java bindings (#9873) @jlowe
Add clang-tidy to libcudf (#9860) @codereport
Remove deprecated methods from Java Table class (#9853) @jlowe
Add test for map column metadata handling in ORC writer (#9852) @vuule
Use pandas to_offset to parse frequency string in date_range (#9843) @isVoid
add templated benchmark with fixture (#9838) @karthikeyann
Use list of column inputs for apply_boolean_mask (#9832) @isVoid
Added a few more tests for Decimal to String cast (#9818) @razajafri
Run doctests. (#9815) @bdice
Avoid overflow for fixed_point round (#9809) @sperlingxx
Move drop_duplicates, drop_na, _gather, take to IndexFrame and create their _base_index counterparts (#9807) @isVoid
Use vector factories for host-device copies. (#9806) @bdice
Refactor host device macros (#9797) @vyasr
Remove unused masked udf cython/c++ code (#9792) @brandon-b-miller
Allow custom sort functions for dask-cudf sort_values (#9789) @charlesbluca
Improve build time of libcudf iterator tests (#9788) @davidwendt
Copy Java native dependencies directly into classpath (#9787) @jlowe
Add decimal types to cuIO benchmarks (#9776) @vuule
Pick smallest decimal type with required precision in ORC reader (#9775) @vuule
Avoid overflow for fixed_point cudf::cast and performance optimization (#9772) @codereport
Use CTAD with Thrust function objects (#9768) @codereport
Refactor TableTest assertion methods to a separate utility class (#9762) @jlowe
Use Java classloader to find test resources (#9760) @jlowe
Allow cast decimal128 to string and add tests (#9756) @razajafri
Load balance optimization for contiguous_split (#9755) @nvdbaranec
Consolidate and improve reset_index (#9750) @isVoid
Update to UCX-Py 0.24 (#9748) @pentschev
Skip cufile tests in JNI build script (#9744) @pxLi
Enable string to decimal 128 cast (#9742) @razajafri
Use stop instead of stop_. (#9735) @bdice
Forward-merge branch-21.12 to branch-22.02 (#9730) @bdice
Improve cmake format script (#9723) @vyasr
Use cuFile direct device reads/writes by default in cuIO (#9722) @vuule
Add directory-partitioned data support to cudf.read_parquet (#9720) @rjzamora
Use stream allocator adaptor for hash join table (#9704) @PointKernel
Update check for inf/nan strings in libcudf float conversion to ignore case (#9694) @davidwendt
Update cudf JNI to 22.02.0-SNAPSHOT (#9681) @pxLi
Replace cudf's concurrentorderedmap with cuco::static_map in semi/anti joins (#9666) @vyasr
Some improvements to parse_decimal function and bindings for is_fixed_point (#9658) @razajafri
Add utility to format ninja-log build times (#9631) @davidwendt
Allow runtime has_nulls parameter for row operators (#9623) @davidwendt
Use fsspec.parquet for improved read_parquet performance from remote storage (#9589) @rjzamora
Refactor bit counting APIs, introduce valid/null count functions, and split host/device side code for segmented counts. (#9588) @bdice
Use List of Columns as Input for drop_nulls, gather and drop_duplicates (#9558) @isVoid
Simplify merge internals and reduce overhead (#9516) @vyasr
Add struct generation support in datagenerator & fuzz tests (#9180) @galipremsagar
Simplify write_csv by removing unnecessary writer/impl classes (#9089) @cwharris

- C++
Published by GPUtester over 4 years ago

https://github.com/rapidsai/cudf - v21.12.02

v21.12.02

- C++
Published by GPUtester over 4 years ago

https://github.com/rapidsai/cudf - v21.12.01

v21.12.01

- C++
Published by GPUtester over 4 years ago

https://github.com/rapidsai/cudf - v21.12.00

🚨 Breaking Changes

Update bitmask_and and bitmask_or to return a pair of resulting mask and count of unset bits (#9616) @PointKernel
Remove sizeof and standardize on memory_usage (#9544) @vyasr
Add support for single-line regex anchors ^/$ in contains_re (#9482) @davidwendt
Refactor sorting APIs (#9464) @vyasr
Update Java nvcomp JNI bindings to nvcomp 2.x API (#9384) @jbrennan333
Support Python UDFs written in terms of rows (#9343) @brandon-b-miller
JNI: Support nested types in ORC writer (#9334) @firestarman
Optionally nullify out-of-bounds indices in segmented_gather(). (#9318) @mythrocks
Refactor cuIO timestamp processing with cuda::std::chrono (#9278) @PointKernel
Various internal MultiIndex improvements (#9243) @vyasr

🐛 Bug Fixes

Fix read_parquet bug for bytes input (#9669) @rjzamora
Use _gather internal for sort_* (#9668) @isVoid
Fix behavior of equals for non-DataFrame Frames and add tests. (#9653) @vyasr
Dont recompute output size if it is already available (#9649) @abellina
Fix read_parquet bug for extended dtypes from remote storage (#9638) @rjzamora
add const when getting data from a JNI data wrapper (#9637) @wjxiz1992
Fix debrotli issue on CUDA 11.5 (#9632) @vuule
Use std::size_t when computing join output size (#9626) @jlowe
Fix usecols parameter handling in dask_cudf.read_csv (#9618) @galipremsagar
Add support for string 'nan', 'inf' & '-inf' values while type-casting to float (#9613) @galipremsagar
Avoid passing NativeFileDatasource to pyarrow in read_parquet (#9608) @rjzamora
Fix test failure with cuda 11.5 in rowbitcount tests. (#9581) @nvdbaranec
Correct LIBCUDACXXCUDACC_VER value computation (#9579) @robertmaynard
Increase max RLE stream size estimate to avoid potential overflows (#9568) @vuule
Fix edge case in tdigest scalar generation for groups containing all nulls. (#9551) @nvdbaranec
Fix pytests failing in cuda-11.5 environment (#9547) @galipremsagar
compile libnvcomp with PTDS if requested (#9540) @jbrennan333
Fix segmented_gather() for null LIST rows (#9537) @mythrocks
Deprecate DataFrame.labelencoding, use private _labelencoding method internally. (#9535) @bdice
Fix several test and benchmark issues related to bitmask allocations. (#9521) @nvdbaranec
Fix for inserting duplicates in groupby result cache (#9508) @karthikeyann
Fix mismatched types error in clip() when using non int64 numeric types (#9498) @davidwendt
Match conda pinnings for style checks (revert part of #9412, #9433). (#9490) @bdice
Make sure all dask-cudf supported aggs are handled in _tree_node_agg (#9487) @charlesbluca
Resolve hash_columns FutureWarning in dask_cudf (#9481) @pentschev
Add fixed point to AllTypes in libcudf unit tests (#9472) @karthikeyann
Fix regex handling of embedded null characters (#9470) @davidwendt
Fix memcheck error in copy-if-else (#9467) @davidwendt
Fix bug in daskcudf.readparquet for index=False (#9453) @rjzamora
Preserve the decimal scale when creating a default scalar (#9449) @revans2
Push down parent nulls when flattening nested columns. (#9443) @mythrocks
Fix memcheck error in gtest SegmentedGatherTest/GatherSliced (#9442) @davidwendt
Revert "Fix quantile division / partition handling for dask-cudf sort… (#9438) @charlesbluca
Allow int-like objects for the decimals argument in round (#9428) @shwina
Fix stream compaction's drop_duplicates API to use stable sort (#9417) @ttnghia
Skip Comparing Uniform Window Results in Var/std Tests (#9416) @isVoid
Fix StructColumn.to_pandas type handling issues (#9388) @galipremsagar
Correct issues in the build dir cudf-config.cmake (#9386) @robertmaynard
Fix Java table partition test to account for non-deterministic ordering (#9385) @jlowe
Fix timestamp truncation/overflow bugs in orc/parquet (#9382) @PointKernel
Fix the crash in stats code (#9368) @devavret
Make Series.hash_encode results reproducible. (#9366) @bdice
Fix libcudf compile warnings on debug 11.4 build (#9360) @davidwendt
Fail gracefully when compiling python UDFs that attempt to access columns with unsupported dtypes (#9359) @brandon-b-miller
Set pass_filenames: false in mypy pre-commit configuration. (#9349) @bdice
Fix cudf_assert in cudf::io::orc::gpu::gpuDecodeOrcColumnData (#9348) @davidwendt
Fix memcheck error in groupby-tdigest getscalarminmax (#9339) @davidwendt
Optimizations for cudf.concat when axis=1 (#9333) @galipremsagar
Use f-string in join helper warning message. (#9325) @bdice
Avoid casting to list or struct dtypes in daskcudf.readparquet (#9314) @rjzamora
Fix null count in statistics for parquet (#9303) @devavret
Potential overflow of decimal32 when casting to int64_t (#9287) @codereport
Fix quantile division / partition handling for dask-cudf sort on null dataframes (#9259) @charlesbluca
Updating cudf version also updates rapids cmake branch (#9249) @robertmaynard
Implement one_hot_encoding in libcudf and bind to python (#9229) @isVoid
BUG FIX: CSV Writer ignores the header parameter when no metadata is provided (#8740) @skirui-source

📖 Documentation

Update Documentation to use TYPED_TEST_SUITE (#9654) @codereport
Add dedicated page for StringHandling in python docs (#9624) @galipremsagar
Update docstring of DataFrame.merge (#9572) @galipremsagar
Use raw strings to avoid SyntaxErrors in parsed docstrings. (#9526) @bdice
Add example to docstrings in rolling.apply (#9522) @isVoid
Update help message to escape quotes in ./build.sh --cmake-args. (#9494) @bdice
Improve Python docstring formatting. (#9493) @bdice
Update table of I/O supported types (#9476) @vuule
Document invalid regex patterns as undefined behavior (#9473) @davidwendt
Miscellaneous documentation fixes to cudf (#9471) @galipremsagar
Fix many documentation errors in libcudf. (#9355) @karthikeyann
Fixing SubwordTokenizer docs issue (#9354) @mayankanand007
Improved deprecation warnings. (#9347) @bdice
doc reorder mr, stream to stream, mr (#9308) @karthikeyann
Deprecate method parameters to DataFrame.join, DataFrame.merge. (#9291) @bdice
Added deprecation warning for .label_encoding() (#9289) @mayankanand007

🚀 New Features

Enable Series.divide and DataFrame.divide (#9630) @vyasr
Update bitmask_and and bitmask_or to return a pair of resulting mask and count of unset bits (#9616) @PointKernel
Add handling of mixed numeric types in to_dlpack (#9585) @galipremsagar
Support re.Pattern object for pat arg in str.replace (#9573) @davidwendt
Add JNI for lists::drop_list_duplicates with keys-values input column (#9553) @ttnghia
Support structs column in min, max, argmin and argmax groupby aggregate() and scan() (#9545) @ttnghia
Move libcudacxx to use rapids_cpm and use newer versions (#9539) @robertmaynard
Add scan min/max support for chrono types to libcudf reduction-scan (not groupby scan) (#9518) @davidwendt
Support args= in apply (#9514) @brandon-b-miller
Add groupby scan min/max support for strings values (#9502) @davidwendt
Add list output option to character_ngrams() function (#9499) @davidwendt
More granular column selection in ORC reader (#9496) @vuule
add min_periods, ddof to groupby covariance, & correlation aggregation (#9492) @karthikeyann
Implement Series.datetime.floor (#9488) @skirui-source
Enable linting of CMake files using pre-commit (#9484) @vyasr
Add support for single-line regex anchors ^/$ in contains_re (#9482) @davidwendt
Augment order_by to Accept a List of null_precedence (#9455) @isVoid
Add format API for list column of strings (#9454) @davidwendt
Enable Datetime/Timedelta dtypes in Masked UDFs (#9451) @brandon-b-miller
Add cudf python groupby.diff (#9446) @karthikeyann
Implement lists::stable_sort_lists for stable sorting of elements within each row of lists column (#9425) @ttnghia
add ctest memcheck using cuda-sanitizer (#9414) @karthikeyann
Support Unary Operations in Masked UDF (#9409) @isVoid
Move Several Series Function to Frame (#9394) @isVoid
MD5 Python hash API (#9390) @bdice
Add cudf strings is_title API (#9380) @davidwendt
Enable casting to int64, uint64, and double in AST code. (#9379) @vyasr
Add support for writing ORC with map columns (#9369) @vuule
extractlistelements() with column_view indices (#9367) @mythrocks
Reimplement lists::drop_list_duplicates for keys-values lists columns (#9345) @ttnghia
Support Python UDFs written in terms of rows (#9343) @brandon-b-miller
JNI: Support nested types in ORC writer (#9334) @firestarman
Optionally nullify out-of-bounds indices in segmented_gather(). (#9318) @mythrocks
Add shallow hash function and shallow equality comparison for column_view (#9312) @karthikeyann
Add CudaMemoryBuffer for cudaMalloc memory using RMM cudamemoryresource (#9311) @rongou
Add parameters to control row index stride and stripe size in ORC writer (#9310) @vuule
Add na_position param to dask-cudf sort_values (#9264) @charlesbluca
Add ascending parameter for dask-cudf sort_values (#9250) @charlesbluca
New array conversion methods (#9236) @vyasr
Series apply method backed by masked UDFs (#9217) @brandon-b-miller
Grouping by frequency and resampling (#9178) @shwina
Pure-python masked UDFs (#9174) @brandon-b-miller
Add Covariance, Pearson correlation for sort groupby (libcudf) (#9154) @karthikeyann
Add calendrical_month_sequence in c++ and date_range in python (#8886) @shwina

🛠️ Improvements

Followup to PR 9088 comments (#9659) @cwharris
Update cuCollections to version that supports installed libcudacxx (#9633) @robertmaynard
Add 11.5 dev.yml to cudf (#9617) @galipremsagar
Add xfail for parquet reader 11.5 issue (#9612) @galipremsagar
remove deprecated Rmm.initialize method (#9607) @rongou
Use HostColumnVectorCore for child columns in JCudfSerialization.unpackHostColumnVectors (#9596) @sperlingxx
Set RMM pool to a fixed size in JNI (#9583) @rongou
Use nvCOMP for Snappy compression/decompression (#9582) @vuule
Build CUDA version agnostic packages for dask-cudf (#9578) @Ethyling
Fixed tests warning: "TYPEDTESTCASE is deprecated, please use TYPEDTESTSUITE" (#9574) @ttnghia
Enable CMake format in CI and fix style (#9570) @vyasr
Add NVTX Start/End Ranges to JNI (#9563) @abellina
Add librdkafka and python-confluent-kafka to dev conda environments s… (#9562) @jdye64
Add offsetsbegin/end() to stringscolumn_view (#9559) @davidwendt
remove alignment options for RMM jni (#9550) @rongou
Add axis parameter passthrough to DataFrame and Series take for pandas API compatibility (#9549) @dantegd
Remove sizeof and standardize on memory_usage (#9544) @vyasr
Adds cudaProfilerStart/cudaProfilerStop in JNI api (#9543) @abellina
Generalize comparison binary operations (#9542) @vyasr
Expose APIs to wrap CUDA or RMM allocations with a Java device buffer instance (#9538) @jlowe
Add scan sum support for duration types to libcudf (#9536) @davidwendt
Force inlining to improve AST performance (#9530) @vyasr
Generalize some more indexed frame methods (#9529) @vyasr
Add Java bindings for rolling window stddev aggregation (#9527) @razajafri
catch rmm::outofmemory exceptions in jni (#9525) @rongou
Add an overload of make_empty_column with type_id parameter (#9524) @ttnghia
Accelerate conditional inner joins with larger right tables (#9523) @vyasr
Initial pass of generalizing decimal support in cudf python layer (#9517) @galipremsagar
Cleanup for flattening nested columns (#9509) @rwlee
Enable running tests using RMM arena and async memory resources (#9506) @rongou
Remove dependency on six. (#9495) @bdice
Cleanup some libcudf strings gtests (#9489) @davidwendt
Rename strings/arraytests.cu to strings/arraytests.cpp (#9480) @davidwendt
Refactor sorting APIs (#9464) @vyasr
Implement DataFrame.hashvalues, deprecate DataFrame.hashcolumns. (#9458) @bdice
Deprecate Series.hash_encode. (#9457) @bdice
Update conda recipes for Enhanced Compatibility effort (#9456) @ajschmidt8
Small clean up to simplify column selection code in ORC reader (#9444) @vuule
add missing stream to scalar.is_valid() wherever stream is available (#9436) @karthikeyann
Adds Deprecation Warnings to one_hot_encoding and Implement get_dummies with Cython API (#9435) @isVoid
Update pre-commit hook URLs. (#9433) @bdice
Remove pyarrow import in dask_cudf.io.parquet (#9429) @charlesbluca
Miscellaneous improvements for UDFs (#9422) @isVoid
Use pre-commit for CI (#9412) @vyasr
Update to UCX-Py 0.23 (#9407) @pentschev
Expose OutOfBoundsPolicy in JNI for Table.gather (#9406) @abellina
Improvements to tdigest aggregation code. (#9403) @nvdbaranec
Add Java API to deserialize a table to host columns (#9402) @jlowe
Frame copy to use class instead of type() (#9397) @madsbk
Change all DeprecationWarnings to FutureWarning. (#9392) @bdice
Update Java nvcomp JNI bindings to nvcomp 2.x API (#9384) @jbrennan333
Add IndexedFrame class and move SingleColumnFrame to a separate module (#9378) @vyasr
Support Arrow NativeFile and PythonFile for remote ORC storage (#9377) @rjzamora
Use Arrow PythonFile for remote CSV storage (#9376) @rjzamora
Add multi-threaded writing to GDS writes (#9372) @devavret
Miscellaneous column cleanup (#9370) @vyasr
Use single kernel to extract all groups in cudf::strings::extract (#9358) @davidwendt
Consolidate binary ops into Frame (#9357) @isVoid
Move rank scan implementations from scaninclusive.cu to rankscan.cu (#9351) @davidwendt
Remove usage of deprecated thrust::hostspacetag. (#9350) @bdice
Use Default Memory Resource for Temporaries in reduction.cpp (#9344) @isVoid
Fix Cython compilation warnings. (#9327) @bdice
Fix some unused variable warnings in libcudf (#9326) @davidwendt
Use optional-iterator for copy-if-else kernel (#9324) @davidwendt
Remove Table class (#9315) @vyasr
Unpin dask and distributed in CI (#9307) @galipremsagar
Add optional-iterator support to indexalator (#9306) @davidwendt
Consolidate more methods in Frame (#9305) @vyasr
Add Arrow-NativeFile and PythonFile support to readparquet and readcsv in cudf (#9304) @rjzamora
Pin mypy in .pre-commit-config.yaml to match conda environment pinning. (#9300) @bdice
Use gather.hpp when gather-map exists in device memory (#9299) @davidwendt
Fix Automerger for Branch-21.12 from branch-21.10 (#9285) @galipremsagar
Refactor cuIO timestamp processing with cuda::std::chrono (#9278) @PointKernel
Change strings copyifelse to use optional-iterator instead of pair-iterator (#9266) @davidwendt
Update cudf java bindings to 21.12.0-SNAPSHOT (#9248) @pxLi
Various internal MultiIndex improvements (#9243) @vyasr
Add detail interface for split and slice(table_view), refactors both function with host_span (#9226) @isVoid
Refactor MD5 implementation. (#9212) @bdice
Update groupby resultcache to allow sharing intermediate results based on columnview instead of requests. (#9195) @karthikeyann
Use nvcomp's snappy decompressor in avro reader (#9181) @devavret
Add isocalendar API support (#9169) @marlenezw
Simplify read_json by removing unnecessary reader/impl classes (#9088) @cwharris
Simplify read_csv by removing unnecessary reader/impl classes (#9041) @cwharris
Refactor hash join with cuCollections multimap (#8934) @PointKernel

- C++
Published by GPUtester over 4 years ago

https://github.com/rapidsai/cudf - v21.10.01

v21.10.01

- C++
Published by GPUtester over 4 years ago

https://github.com/rapidsai/cudf - v21.10.00

🚨 Breaking Changes

Remove Cython APIs for table view generation (#9199) @vyasr
Upgrade pandas version in cudf (#9147) @galipremsagar
Make AST operators nullable (#9096) @vyasr
Remove the option to pass data types as strings to read_csv and read_json (#9079) @vuule
Update JNI java CSV APIs to not use deprecated API (#9066) @revans2
Support additional format specifiers in from_timestamps (#9047) @davidwendt
Expose expression base class publicly and simplify public AST API (#9045) @vyasr
Add support for struct type in ORC writer (#9025) @vuule
Remove aliases of various api.types APIs from utils.dtypes. (#9011) @vyasr
Java bindings for conditional join output sizes (#9002) @jlowe
Move compute_column API out of ast namespace (#8957) @vyasr
cudf.dtype function (#8949) @shwina
Refactor Frame reductions (#8944) @vyasr
Add nested column selection to parquet reader (#8933) @devavret
JNI Aggregation Type Changes (#8919) @revans2
Add groupbyaggregation and groupbyscan_aggregation classes and force their usage. (#8906) @nvdbaranec
Expand CSV and JSON reader APIs to accept dtypes as a vector or map of data_type objects (#8856) @vuule
Change cudf docs theme to pydata theme (#8746) @galipremsagar
Enable compiled binary ops in libcudf, python and java (#8741) @karthikeyann
Make groupby transform-like op order match original data order (#8720) @isVoid

🐛 Bug Fixes

fixed_point cudf::groupby for mean aggregation (#9296) @codereport
Fix interleave_columns when the input string lists column having empty child column (#9292) @ttnghia
Update nvcomp to include fixes for installation of headers (#9276) @devavret
Fix Java column leak in testParquetWriteMap (#9271) @jlowe
Fix call to thrust::reducebykey in argmin/argmax libcudf groupby (#9263) @davidwendt
Fixing empty input to getMapValue crashing (#9262) @hyperbolic2346
Fix duplicate names issue in MultiIndex.deserialize (#9258) @galipremsagar
Dataframe.sort_index optimizations (#9238) @galipremsagar
Temporarily disabling problematic test in parquet writer (#9230) @devavret
Explicitly disable groupby on unsupported key types. (#9227) @mythrocks
Fix gather for sliced input structs column (#9218) @ttnghia
Fix JNI code for left semi and anti joins (#9207) @jlowe
Only install thrust when using a non 'system' version (#9206) @robertmaynard
Remove zlib from libcudf public CMake dependencies (#9204) @robertmaynard
Fix out-of-bounds memory read in orc gpuEncodeOrcColumnData (#9196) @davidwendt
Fix gather() for STRUCT inputs with no nulls in members. (#9194) @mythrocks
getcucollections properly uses rapidscpm_find (#9189) @robertmaynard
rapids-export correctly reference build code block and doc strings (#9186) @robertmaynard
Fix logic while parsing the sum statistic for numerical orc columns (#9183) @ayushdg
Add handling for nulls in dask_cudf.sorting.quantile_divisions (#9171) @charlesbluca
Approximate overflow detection in ORC statistics (#9163) @vuule
Use decimal precision metadata when reading from parquet files (#9162) @shwina
Fix variable name in Java build script (#9161) @jlowe
Import rapids-cmake modules using the correct cmake variable. (#9149) @robertmaynard
Fix conditional joins with empty left table (#9146) @vyasr
Fix joining on indexes with duplicate level names (#9137) @shwina
Fixes missing child column name in dtype while reading ORC file. (#9134) @rgsl888prabhu
Apply type metadata after column is slice-copied (#9131) @isVoid
Fix a bug: innerjoinsize return zero if build table is empty (#9128) @PointKernel
Fix multi hive-partition parquet reading in dask-cudf (#9122) @rjzamora
Support null literals in expressions (#9117) @vyasr
Fix cudf::hash_join output size for struct joins (#9107) @jlowe
Import fix (#9104) @shwina
Fix cudf::strings::isfixedpoint checking of overflow for decimal32 (#9093) @davidwendt
Fix branchstack calculation in `rowbit_count()` (#9076) @mythrocks
Fetch rapids-cmake to work around cuCollection cmake issue (#9075) @jlowe
Fix compilation errors in groupby benchmarks. (#9072) @nvdbaranec
Preserve float16 upscaling (#9069) @galipremsagar
Fix memcheck read error in libcudf contiguous_split (#9067) @davidwendt
Add support for reading ORC file with no row group index (#9060) @rgsl888prabhu
Various multiindex related fixes (#9036) @shwina
Avoid rebuilding cython in build.sh (#9034) @brandon-b-miller
Add support for percentile dispatch in dask_cudf (#9031) @galipremsagar
cudf resolve nvcc 11.0 compiler crashes during codegen (#9028) @robertmaynard
Fetch correct grouping keys agg of dask groupby (#9022) @galipremsagar
Allow where() to work with a Series and other=cudf.NA (#9019) @sarahyurick
Use correct index when returning Series from GroupBy.apply() (#9016) @charlesbluca
Fix Dataframe indexer setitem when array is passed (#9006) @galipremsagar
Fix ORC reading of files with struct columns that have null values (#9005) @vuule
Ensure JNI native libraries load when CompiledExpression loads (#8997) @jlowe
Fix memory read error in getdremeldata in page_enc.cu (#8995) @davidwendt
Fix memory write error in getlistchildtolistrowmapping utility (#8994) @davidwendt
Fix debug compile error for csv_test.cpp (#8981) @davidwendt
Fix memory read/write error in concatenatelistsignore_null (#8978) @davidwendt
Fix concatenation of cudf.RangeIndex (#8970) @galipremsagar
Java conditional joins should not require matching column counts (#8955) @jlowe
Fix concatenate empty structs (#8947) @sperlingxx
Fix cuda-memcheck errors for some libcudf functions (#8941) @davidwendt
Apply series name to result of SeriesGroupby.apply() (#8939) @charlesbluca
cdef packed_columns as cppclass instead of struct (#8936) @charlesbluca
Inserting a cudf.NA into a DataFrame (#8923) @sarahyurick
Support casting with Pandas dtype aliases (#8920) @sarahyurick
Allow sort_values to accept same kind values as Pandas (#8912) @sarahyurick
Enable casting to pandas nullable dtypes (#8889) @brandon-b-miller
Fix libcudf memory errors (#8884) @karthikeyann
Throw KeyError when accessing field from struct with nonexistent key (#8880) @NV-jpt
replace auto with auto& ref for cast<&> (#8866) @karthikeyann
Add missing include<optional> in binops (#8864) @karthikeyann
Fix select_dtypes to work when non-class dtypes present in dataframe (#8849) @sarahyurick
Re-enable JSON tests (#8843) @vuule
Support header with embedded delimiter in csv writer (#8798) @davidwendt

📖 Documentation

Add IO docs page in cudf documentation (#9145) @galipremsagar
use correct namespace in cuio code examples (#9037) @cwharris
Restructuring Contributing doc (#9026) @iskode
Update stable version in readme (#9008) @galipremsagar
Add spans and more include guidelines to libcudf developer guide (#8931) @harrism
Update Java build instructions to mention Arrow S3 and Docker (#8867) @jlowe
List GDS-enabled formats in the docs (#8805) @vuule
Change cudf docs theme to pydata theme (#8746) @galipremsagar

🚀 New Features

Revert "Add shallow hash function and shallow equality comparison for column_view (#9185)" (#9283) @karthikeyann
Align DataFrame.apply signature with pandas (#9275) @brandon-b-miller
Add struct type support for drop_list_duplicates (#9202) @ttnghia
support CUDA async memory resource in JNI (#9201) @rongou
Add shallow hash function and shallow equality comparison for column_view (#9185) @karthikeyann
Superimpose null masks for STRUCT columns. (#9144) @mythrocks
Implemented bindings for ceil timestamp operation (#9141) @shaneding
Adding MAP type support for ORC Reader (#9132) @rgsl888prabhu
Implement interleave_columns for lists with arbitrary nested type (#9130) @ttnghia
Add python bindings to fixed-size window and groupby rolling.var, rolling.std (#9097) @isVoid
Make AST operators nullable (#9096) @vyasr
Java bindings for approx_percentile (#9094) @andygrove
Add dseries.struct.explode (#9086) @isVoid
Add support for BaseIndexer in Rolling APIs (#9085) @galipremsagar
Remove the option to pass data types as strings to read_csv and read_json (#9079) @vuule
Add handling for nested dicts in dask-cudf groupby (#9054) @charlesbluca
Added Series.dt.isquarterstart and Series.dt.isquarterend (#9046) @TravisHester
Support nested types for nth_element reduction (#9043) @sperlingxx
Update sort groupby to use non-atomic operation (#9035) @karthikeyann
Add support for struct type in ORC writer (#9025) @vuule
Implement interleave_columns for structs columns (#9012) @ttnghia
Add groupby first and last aggregations (#9004) @shwina
Add DecimalBaseColumn and move as_decimal_column (#9001) @isVoid
Python/Cython bindings for multibyte_split (#8998) @jdye64
Support scalar months in add_calendrical_months, extends API to INT32 support (#8991) @isVoid
Added Series.dt.ismonthend (#8989) @TravisHester
Support for using tdigests to compute approximate percentiles. (#8983) @nvdbaranec
Support "unflatten" of columns flattened via flatten_nested_columns(): (#8956) @mythrocks
Implement timestamp ceil (#8942) @shaneding
Add nested column selection to parquet reader (#8933) @devavret
Expose conditional join size calculation (#8928) @vyasr
Support Nulls in Timeseries Generator (#8925) @isVoid
Avoid index equality check in _CPackedColumns.from_py_table() (#8917) @charlesbluca
Add dot product binary op (#8909) @charlesbluca
Expose days_in_month function in libcudf and add python bindings (#8892) @isVoid
Series string repeat (#8882) @sarahyurick
Python binding for quarters (#8862) @shaneding
Expand CSV and JSON reader APIs to accept dtypes as a vector or map of data_type objects (#8856) @vuule
Add Java bindings for AST transform (#8846) @jlowe
Series datetime ismonthstart (#8844) @sarahyurick
Support bracket syntax for cudf::strings::replacewithbackrefs group index values (#8841) @davidwendt
Support VARIANCE and STD aggregation in rolling op (#8809) @isVoid
Add quarters to libcudf datetime (#8779) @shaneding
Linear Interpolation of nans via cupy (#8767) @brandon-b-miller
Enable compiled binary ops in libcudf, python and java (#8741) @karthikeyann
Make groupby transform-like op order match original data order (#8720) @isVoid
multibyte_split (#8702) @cwharris
Implement JNI for strings:repeat_strings that repeats each string separately by different numbers of times (#8572) @ttnghia

🛠️ Improvements

Pin max dask and distributed versions to 2021.09.1 (#9286) @galipremsagar
Optimized fsspec data transfer for remote file-systems (#9265) @rjzamora
Skip dask-cudf tests on arm64 (#9252) @Ethyling
Use nvcomp's snappy compressor in ORC writer (#9242) @devavret
Only run imports tests on x86_64 (#9241) @Ethyling
Remove unnecessary call to device_uvector::release() (#9237) @harrism
Use nvcomp's snappy decompression in ORC reader (#9235) @devavret
Add grouped_rolling test with STRUCT groupby keys. (#9228) @mythrocks
Optimize cudf.concat for axis=0 (#9222) @galipremsagar
Fix some libcudf calls not passing the stream parameter (#9220) @davidwendt
Add min and max bounds for random dataframe generator numeric types (#9211) @galipremsagar
Improve performance of expression evaluation (#9210) @vyasr
Misc optimizations in cudf (#9203) @galipremsagar
Remove Cython APIs for table view generation (#9199) @vyasr
Add JNI support for droplistduplicates (#9198) @revans2
Update pandas versions in conda recipes and requirements.txt files (#9197) @galipremsagar
Minor C++17 cleanup of groupby.cu: structured bindings, more concise lambda, etc (#9193) @codereport
Explicit about bitwidth difference between cudf boolean and arrow boolean (#9192) @isVoid
Remove sourceindex from MultiIndex (#9191) @vyasr
Fix typo in the name of cudf-testing-targets.cmake (#9190) @trxcllnt
Add support for single-digits in cudf::to_timestamps (#9173) @davidwendt
Fix cufilejni build include path (#9168) @pxLi
dask_cudf dispatch registering cleanup (#9160) @galipremsagar
Remove unneeded stream/mr from a cudf::makestringscolumn (#9148) @davidwendt
Upgrade pandas version in cudf (#9147) @galipremsagar
make data chunk reader return unique_ptr (#9129) @cwharris
Add backend for percentile_lookup dispatch (#9118) @galipremsagar
Refactor implementation of column setitem (#9110) @vyasr
Fix compile warnings found using nvcc 11.4 (#9101) @davidwendt
Update to UCX-Py 0.22 (#9099) @pentschev
Simplify read_avro by removing unnecessary writer/impl classes (#9090) @cwharris
Allowing %f in format to return nanoseconds (#9081) @marlenezw
Java bindings for cudf::hash_join (#9080) @jlowe
Remove stale code in ColumnBase._fill (#9078) @isVoid
Add support for get_group in GroupBy (#9070) @galipremsagar
Remove remaining "support" methods from DataFrame (#9068) @vyasr
Update JNI java CSV APIs to not use deprecated API (#9066) @revans2
Added method to remove null_masks if the column has no nulls (#9061) @razajafri
Consolidate Several Series and Dataframe Methods (#9059) @isVoid
Remove usage of string based set_dtypes for csv & json readers (#9049) @galipremsagar
Remove some debug print statements from gtests (#9048) @davidwendt
Support additional format specifiers in from_timestamps (#9047) @davidwendt
Expose expression base class publicly and simplify public AST API (#9045) @vyasr
move filepath and mmap logic out of json/csv up to functions.cpp (#9040) @cwharris
Refactor Index hierarchy (#9039) @vyasr
cudf now leverages rapids-cmake to reduce CMake boilerplate (#9030) @robertmaynard
Add support for STRUCT input to groupby (#9024) @mythrocks
Refactor Frame scans (#9021) @vyasr
Remove duplicate set_categories code (#9018) @isVoid
Map support for ParquetWriter (#9013) @razajafri
Remove aliases of various api.types APIs from utils.dtypes. (#9011) @vyasr
Java bindings for conditional join output sizes (#9002) @jlowe
Remove copyconstruct factory (#8999) @vyasr
ENH Allow arbitrary CMake config options in build.sh (#8996) @dillon-cullinan
A small optimization for JNI copy column view to column vector (#8985) @revans2
Fix nvcc warnings in ORC writer (#8975) @devavret
Support nested structs in rank and dense rank (#8962) @rwlee
Move compute_column API out of ast namespace (#8957) @vyasr
Series datetime isyearend and isyearstart (#8954) @marlenezw
Make Java AstNode public (#8953) @jlowe
Replace allocate with deviceuvector for subwordtokenize internal tables (#8952) @davidwendt
cudf.dtype function (#8949) @shwina
Refactor Frame reductions (#8944) @vyasr
Add deprecation warning for Series.set_mask API (#8943) @galipremsagar
Move AST evaluator into a separate header (#8930) @vyasr
JNI Aggregation Type Changes (#8919) @revans2
Move template parameter to function parameter in cudf::detail::leftsemianti_join (#8914) @davidwendt
Upgrade arrow & pyarrow to 5.0.0 (#8908) @galipremsagar
Add groupbyaggregation and groupbyscan_aggregation classes and force their usage. (#8906) @nvdbaranec
Move structs_column_tests.cu to .cpp. (#8902) @mythrocks
Add stream and memory-resource parameters to struct-scalar copy ctor (#8901) @davidwendt
Combine linearizer and ast_plan (#8900) @vyasr
Add Java bindings for conditional join gather maps (#8888) @jlowe
Remove max version pin for dask & distributed on development branch (#8881) @galipremsagar
fix cufilejni build w/ c++17 (#8877) @pxLi
Add struct accessor to dask-cudf (#8874) @NV-jpt
Migrate dask-cudf CudfEngine to leverage ArrowDatasetEngine (#8871) @rjzamora
Add JNI for extractquarter, addcalendricalmonths, and isleap_year (#8863) @revans2
Change cudf::scalar copy and move constructors to protected (#8857) @davidwendt
Replace is_same<>::value with is_same_v<> (#8852) @codereport
Add min pytorch version to importorskip in pytest (#8851) @galipremsagar
Java bindings for regex replace (#8847) @jlowe
Remove make strings children with null mask (#8830) @davidwendt
Refactor conditional joins (#8815) @vyasr
Small cleanup (unused headers / commented code removals) (#8799) @codereport
ENH Replace gpucicondaretry with gpucimambaretry (#8770) @dillon-cullinan
Update cudf java bindings to 21.10.0-SNAPSHOT (#8765) @pxLi
Refactor and improve join benchmarks with nvbench (#8734) @PointKernel
Refactor Python factories and remove usage of Table for libcudf output handling (#8687) @vyasr
Optimize URL Decoding (#8622) @gaohao95
Parquet writer dictionary encoding refactor (#8476) @devavret
Use nvcomp's snappy decompression in parquet reader (#8252) @devavret
Use nvcomp's snappy compressor in parquet writer (#8229) @devavret

- C++
Published by GPUtester over 4 years ago

https://github.com/rapidsai/cudf - v21.08.03

v21.08.03

- C++
Published by GPUtester over 4 years ago

https://github.com/rapidsai/cudf - v21.08.02

v21.08.02

- C++
Published by GPUtester almost 5 years ago

https://github.com/rapidsai/cudf - v21.08.01

v21.08.01

- C++
Published by GPUtester almost 5 years ago

https://github.com/rapidsai/cudf - v21.08.00

🚨 Breaking Changes

Fix a crash in pack() when being handed tables with no columns. (#8697) @nvdbaranec
Remove unused cudf::strings::create_offsets (#8663) @davidwendt
Add delimiter parameter to cudf::strings::capitalize() (#8620) @davidwendt
Change default datetime index resolution to ns to match pandas (#8611) @vyasr
Add sequence_type parameter to cudf::strings::title function (#8602) @davidwendt
Add strings::repeat_strings API that can repeat each string a different number of times (#8561) @ttnghia
String-to-boolean conversion is different from Pandas (#8549) @skirui-source
Add accurate hash join size functions (#8453) @PointKernel
Expose a Decimal32Dtype in cuDF Python (#8438) @skirui-source
Update dask make_meta changes to be compatible with dask upstream (#8426) @galipremsagar
Adapt cudf::scalar classes to changes in rmm::device_scalar (#8411) @harrism
Remove special Index class from the general index class hierarchy (#8309) @vyasr
Add first-class dtype utilities (#8308) @vyasr
ORC - Support reading multiple orc files/buffers in a single operation (#8142) @jdye64
Upgrade arrow to 4.0.1 (#7495) @galipremsagar

🐛 Bug Fixes

Fix contains check in string column (#8834) @galipremsagar
Remove unused variable from row_bit_count_test. (#8829) @mythrocks
Fixes issue with null struct columns in ORC reader (#8819) @rgsl888prabhu
Set CMake vars for python/parquet support in libarrow builds (#8808) @vyasr
Handle empty child columns in rowbitcount() (#8791) @mythrocks
Revert "Remove cudf unneeded build time requirement of the cuda driver" (#8784) @robertmaynard
Fix isort error in utils.pyx (#8771) @charlesbluca
Handle sliced struct/list columns properly in concatenate() bounds checking. (#8760) @nvdbaranec
Fix issues with _CPackedColumns.serialize() handling of host and device data (#8759) @charlesbluca
Fix issues with MultiIndex in dropna, stack & reset_index (#8753) @galipremsagar
Write pandas extension types to parquet file metadata (#8749) @devavret
Fix where to handle DataFrame & Series input combination (#8747) @galipremsagar
Fix replace to handle null values correctly (#8744) @galipremsagar
Handle sliced structs properly in pack/contiguous_split. (#8739) @nvdbaranec
Fix issue in slice() where columns with a positive offset were computing null counts incorrectly. (#8738) @nvdbaranec
Fix cudf.Series constructor to handle list of sequences (#8735) @galipremsagar
Fix min/max sorted groupby aggregation on string column with nulls (argmin, argmax sentinel value missing on nulls) (#8731) @karthikeyann
Fix orc reader assert on create data_type in debug (#8706) @davidwendt
Fix min/max inclusive cudf::scan for strings column (#8705) @davidwendt
JNI: Fix driver version assertion logic in testGetCudaRuntimeInfo (#8701) @sperlingxx
Adding fix for skip_rows and crash in orc reader (#8700) @rgsl888prabhu
Bug fix: replace_nulls_policy functor not returning correct indices for gathermap (#8699) @isVoid
Fix a crash in pack() when being handed tables with no columns. (#8697) @nvdbaranec
Add post-processing steps to dask_cudf.groupby.CudfSeriesGroupby.aggregate (#8694) @charlesbluca
JNI build no longer looks for Arrow in conda environment (#8686) @jlowe
Handle arbitrarily different data in null list column rows when checking for equivalency. (#8666) @nvdbaranec
Add ConfigureNVBench to avoid concurrent main() entry points (#8662) @PointKernel
Pin *arrow to use *cuda in run (#8651) @jakirkham
Add proper support for tolerances in testing methods. (#8649) @vyasr
Support multi-char case conversion in capitalize function (#8647) @davidwendt
Fix repeated mangled names in read_csv with duplicate column names (#8645) @karthikeyann
Temporarily disable libcudf example build tests (#8642) @isVoid
Use conda-sourced cudf artifacts for libcudf example in CI (#8638) @isVoid
Ensure dev environment uses Arrow GPU packages (#8637) @charlesbluca
Fix bug that columns only initialized once when specified columns and index in dataframe ctor (#8628) @isVoid
Propagate *kwargs through to as__column methods (#8618) @shwina
Fix orcreaderbenchmark.cpp compile error (#8609) @davidwendt
Fix missed renumbering of Aggregation values (#8600) @revans2
Update cmake to 3.20.5 in the Java Docker image (#8593) @NvTimLiu
Fix bug in replacewithbackrefs when group has greedy quantifier (#8575) @davidwendt
Apply metadata to keys before returning in Frame._encode (#8560) @charlesbluca
Fix for strings containing special JSON characters in getjsonobject(). (#8556) @nvdbaranec
Fix debug compile error in gatherstructtests.cpp (#8554) @davidwendt
String-to-boolean conversion is different from Pandas (#8549) @skirui-source
Fix __repr__ output with display.max_rows is None (#8547) @galipremsagar
Fix size passed to column constructors in withtype_metadata (#8539) @shwina
Properly retrieve last column when -1 is specified for column index (#8529) @isVoid
Fix importing apply from dask (#8517) @galipremsagar
Fix offset of the string dictionary length stream (#8515) @vuule
Fix double counting of selected columns in CSV reader (#8508) @ochan1
Incorrect map size in scattertogather corrupts struct columns (#8507) @gerashegalov
replace_nulls properly propagates memory resource to gather calls (#8500) @robertmaynard
Disallow groupby aggs for StructColumns (#8499) @charlesbluca
Fixes out-of-bounds access for small files in unzip (#8498) @elstehle
Adding support for writing empty dataframe (#8490) @shaneding
Fix exclusive scan when including nulls and improve testing (#8478) @harrism
Add workaround for crash in libcudf debug build using outputindexalator in thrust::lowerbound (#8432) @davidwendt
Install only the same Thrust files that Thrust itself installs (#8420) @robertmaynard
Add nightly version for ucx-py in ci script (#8419) @galipremsagar
Fix nullequality config of rollingcollect_set (#8415) @sperlingxx
CollectSetAggregation: implement RollingAggregation interface (#8406) @sperlingxx
Handle pre-sliced nested columns in contiguous_split. (#8391) @nvdbaranec
Fix bitmask_tests.cpp host accessing device memory (#8370) @davidwendt
Fix concurrentunorderedmap to prevent accessing padding bits in pair_type (#8348) @davidwendt
BUG FIX: Raise appropriate strings error when concatenating strings column (#8290) @skirui-source
Make gpuCI and pre-commit style configurations consistent (#8215) @charlesbluca
Add collect list to dask-cudf groupby aggregations (#8045) @charlesbluca

📖 Documentation

Update Python UDFs notebook (#8810) @brandon-b-miller
Fix dask.dataframe API docs links after reorg (#8772) @jsignell
Fix instructions for running cuDF/dask-cuDF tests in CONTRIBUTING.md (#8724) @shwina
Translate Markdown documentation to rST and remove recommonmark (#8698) @vyasr
Fixed spelling mistakes in libcudf documentation (#8664) @karthikeyann
Custom Sphinx Extension: PandasCompat (#8643) @isVoid
Fix README.md (#8535) @ajschmidt8
Change namespace contains_nulls to struct (#8523) @davidwendt
Add info about NVTX ranges to dev guide (#8461) @jrhemstad
Fixed documentation bug in groupby agg method (#8325) @ahmet-uyar

🚀 New Features

Fix concatenating structs (#8811) @shaneding
Implement JNI for groupby aggregations M2 and MERGE_M2 (#8763) @ttnghia
Bump isort to 5.6.4 and remove isort overrides made for 5.0.7 (#8755) @charlesbluca
Implement __setitem__ for StructColumn (#8737) @shaneding
Add is_leap_year to DateTimeProperties and DatetimeIndex (#8736) @isVoid
Add struct.explode() method (#8729) @shwina
Add DataFrame.to_struct() method to convert a DataFrame to a struct Series (#8728) @shwina
Add support for list type in ORC writer (#8723) @vuule
Fix slicing from struct columns and accessing struct columns (#8719) @shaneding
Add datetime::is_leap_year (#8711) @isVoid
Accessing struct columns from dask_cudf (#8675) @shaneding
Added pct_change to Series (#8650) @TravisHester
Add strings support to cudf::shift function (#8648) @davidwendt
Support Scatter struct_scalar (#8630) @isVoid
Struct scalar from host dictionary (#8629) @shaneding
Add dayofyear and dayofyear to Series, DatetimeColumn, and DatetimeIndex (#8626) @beckernick
JNI support for capitalize (#8624) @firestarman
Add delimiter parameter to cudf::strings::capitalize() (#8620) @davidwendt
Add NVBench in CMake (#8619) @PointKernel
Change default datetime index resolution to ns to match pandas (#8611) @vyasr
ListColumn __setitem__ (#8606) @brandon-b-miller
Implement groupby aggregations M2 and MERGE_M2 (#8605) @ttnghia
Add sequence_type parameter to cudf::strings::title function (#8602) @davidwendt
Adding support for list and struct type in ORC Reader (#8599) @rgsl888prabhu
Benchmark for strings::repeat_strings APIs (#8589) @ttnghia
Nested scalar support for copy if else (#8588) @gerashegalov
User specified decimal columns to float64 (#8587) @jdye64
Add get_element for struct column (#8578) @isVoid
Python changes for adding __getitem__ for struct (#8577) @shaneding
Add strings::repeat_strings API that can repeat each string a different number of times (#8561) @ttnghia
Refactor tests/iterator_utilities.hpp functions (#8540) @ttnghia
Support MERGELISTS and MERGESETS in Java package (#8516) @sperlingxx
Decimal support csv reader (#8511) @elstehle
Add column type tests (#8505) @isVoid
Warn when downscaling decimal columns (#8492) @ChrisJar
Add JNI for strings::repeat_strings (#8491) @ttnghia
Add Index.get_loc for Numerical, String Index support (#8489) @isVoid
Expose half_up rounding in cuDF (#8477) @shwina
Java APIs to fetch CUDA runtime info (#8465) @sperlingxx
Add str.edit_distance_matrix (#8463) @isVoid
Support constructing cudf.Scalar objects from host side lists (#8459) @brandon-b-miller
Add accurate hash join size functions (#8453) @PointKernel
Add cudf::strings::integertohex convert API (#8450) @davidwendt
Create objects from iterables that contain cudf.NA (#8442) @brandon-b-miller
JNI bindings for sort_lists (#8439) @sperlingxx
Expose a Decimal32Dtype in cuDF Python (#8438) @skirui-source
Replace all_null() and all_valid() by iterator_all_nulls() and iterator_no_null() in tests (#8437) @ttnghia
Implement groupby MERGE_LISTS and MERGE_SETS aggregates (#8436) @ttnghia
Add public libcudf match_dictionaries API (#8429) @davidwendt
Add move constructors for string_scalar and struct_scalar (#8428) @ttnghia
Implement strings::repeat_strings (#8423) @ttnghia
STRUCT column support for cudf::merge. (#8422) @nvdbaranec
Implement reverse in libcudf (#8410) @shaneding
Support multiple input files/buffers for read_json (#8403) @jdye64
Improve test coverage for struct search (#8396) @ttnghia
Add groupby.fillna (#8362) @isVoid
Enable AST-based joining (#8214) @vyasr
Generalized null support in user defined functions (#8213) @brandon-b-miller
Add compiled binary operation (#8192) @karthikeyann
Implement .describe() for DataFrameGroupBy (#8179) @skirui-source
ORC - Support reading multiple orc files/buffers in a single operation (#8142) @jdye64
Add Python bindings for lists::concatenate_list_elements and expose them as .list.concat() (#8006) @shwina
Use Arrow URI FileSystem backed instance to retrieve remote files (#7709) @jdye64
Example to build custom application and link to libcudf (#7671) @isVoid
Upgrade arrow to 4.0.1 (#7495) @galipremsagar

🛠️ Improvements

Provide a better error message when CUDA::cuda_driver not found (#8794) @robertmaynard
Remove anonymous namespace from null_mask.cuh (#8786) @nvdbaranec
Allow cudf to be built without libcuda.so existing (#8751) @robertmaynard
Pin mimesis to <4.1 (#8745) @galipremsagar
Update conda environment name for CI (#8692) @ajschmidt8
Remove flatbuffers dependency (#8671) @Ethyling
Add options to build Arrow with Python and Parquet support (#8670) @trxcllnt
Remove unused cudf::strings::create_offsets (#8663) @davidwendt
Update GDS lib version to 1.0.0 (#8654) @pxLi
Support for groupby/scan rank and dense_rank aggregations (#8652) @rwlee
Fix usage of deprecated arrow ipc API (#8632) @revans2
Use absolute imports in cudf (#8631) @galipremsagar
ENH Add Java CI build script (#8627) @dillon-cullinan
Add DeprecationWarning to ser.str.subword_tokenize (#8603) @VibhuJawa
Rewrite binary operations for improved performance and additional type support (#8598) @vyasr
Fix mypy errors surfacing because of numpy-1.21.0 (#8595) @galipremsagar
Remove unneeded includes from cudf::string_view headers (#8594) @davidwendt
Use cmake 3.20.1 as it is now required by rmm (#8586) @robertmaynard
Remove device debug symbols from cmake CUDFCUDAFLAGS (#8584) @davidwendt
Dask-CuDF: use default Dask Dataframe optimizer (#8581) @madsbk
Remove checking if an unsigned value is less than zero (#8579) @robertmaynard
Remove stringscount parameter from cudf::strings::detail::createcharschildcolumn (#8576) @davidwendt
Make cudf.api.types imports consistent (#8571) @galipremsagar
Modernize libcudf basic example CMakeFile; updates CI build tests (#8568) @isVoid
Rename concatenate_tests.cu to .cpp (#8555) @davidwendt
enable window lead/lag test on struct (#8548) @wbo4958
Add Java methods to split and write column views (#8546) @razajafri
Small cleanup (#8534) @codereport
Unpin dask version in CI (#8533) @galipremsagar
Added optional flag for building Arrow with S3 filesystem support (#8531) @jdye64
Minor clean up of various internal column and frame utilities (#8528) @vyasr
Rename some copying_test source files .cu to .cpp (#8527) @davidwendt
Correct the last warnings and issues when using newer cuda versions (#8525) @robertmaynard
Correct unused parameter warnings in transform and unary ops (#8521) @robertmaynard
Correct unused parameter warnings in string algorithms (#8509) @robertmaynard
Add in JNI APIs for scan, replacenulls, groupby.scan, and groupby.replacenulls (#8503) @revans2
Fix 21.08 forward-merge conflicts (#8502) @ajschmidt8
Fix Cython formatting command in Contributing.md. (#8496) @marlenezw
Bug/correct unused parameters in reshape and text (#8495) @robertmaynard
Correct unused parameter warnings in partitioning and stream compact (#8494) @robertmaynard
Correct unused parameter warnings in labelling and list algorithms (#8493) @robertmaynard
Refactor index construction (#8485) @vyasr
Correct unused parameter warnings in replace algorithms (#8483) @robertmaynard
Correct unused parameter warnings in reduction algorithms (#8481) @robertmaynard
Correct unused parameter warnings in io algorithms (#8480) @robertmaynard
Correct unused parameter warnings in interop algorithms (#8479) @robertmaynard
Correct unused parameter warnings in filling algorithms (#8468) @robertmaynard
Correct unused parameter warnings in groupby (#8467) @robertmaynard
use libcu++ time_point as timestamp (#8466) @karthikeyann
Modify reprog_device::extract to return groups in a single pass (#8460) @davidwendt
Update minimum Dask requirement to 2021.6.0 (#8458) @pentschev
Fix failures when performing binary operations on DataFrames with empty columns (#8452) @ChrisJar
Fix conflicts in 8447 (#8448) @ajschmidt8
Add serialization methods for List and StructDtype (#8441) @charlesbluca
Replace makeemptystringscolumn with makeempty_column (#8435) @davidwendt
JNI bindings for get_element (#8433) @revans2
Update dask make_meta changes to be compatible with dask upstream (#8426) @galipremsagar
Unpin dask version on CI (#8425) @galipremsagar
Add benchmark for strings/fixed_point convert APIs (#8417) @davidwendt
Adapt cudf::scalar classes to changes in rmm::device_scalar (#8411) @harrism
Add benchmark for strings/integers convert APIs (#8402) @davidwendt
Enable multi-file partitioning in daskcudf.readparquet (#8393) @rjzamora
Correct unused parameter warnings in rolling algorithms (#8390) @robertmaynard
Correct unused parameters in column round and search (#8389) @robertmaynard
Add functionality to apply Dtype metadata to ColumnBase (#8373) @charlesbluca
Refactor setting stack size in regex code (#8358) @davidwendt
Update Java bindings to 21.08-SNAPSHOT (#8344) @pxLi
Replace remaining uses of device_vector (#8343) @harrism
Statically link libnvcomp into libcudfjni (#8334) @jlowe
Resolve auto merge conflicts for Branch 21.08 from branch 21.06 (#8329) @galipremsagar
Minor code refactor for sorted_order (#8326) @wbo4958
Remove special Index class from the general index class hierarchy (#8309) @vyasr
Add first-class dtype utilities (#8308) @vyasr
Add option to link Java bindings with Arrow dynamically (#8307) @jlowe
Refactor ColumnMethods and its subclasses to remove column argument and require parent argument (#8306) @shwina
Refactor scatter for list columns (#8255) @isVoid
Expose pack/unpack API to Python (#8153) @charlesbluca
Adding cudf.cut method (#8002) @marlenezw
Optimize string gather performance for large strings (#7980) @gaohao95
Add peak memory usage tracking to cuIO benchmarks (#7770) @devavret
Updating Clang Version to 11.0.0 (#6695) @codereport

- C++
Published by GPUtester almost 5 years ago

https://github.com/rapidsai/cudf - v21.06.01

- C++
Published by GPUtester almost 5 years ago

https://github.com/rapidsai/cudf - v21.06.00

🚨 Breaking Changes

Add support for make_meta_obj dispatch in dask-cudf (#8342) @galipremsagar
Add separator-on-null parameter to strings concatenate APIs (#8282) @davidwendt
Introduce a common parent class for NumericalColumn and DecimalColumn (#8278) @vyasr
Update ORC statistics API to use C++17 standard library (#8241) @vuule
Preserve column hierarchy when getting NULL row from LIST column (#8206) @isVoid
Groupby.shift c++ API refactor and python binding (#8131) @isVoid

🐛 Bug Fixes

Fix struct flattening to add a validity column only when the input column has null element (#8374) @ttnghia
Compilation fix: Remove redefinition for std::is_same_v() (#8369) @mythrocks
Add backward compatibility for dask-cudf to work with other versions of dask (#8368) @galipremsagar
Handle empty results with nested types in copyifelse (#8359) @nvdbaranec
Handle nested column types properly for empty parquet files. (#8350) @nvdbaranec
Raise error when unsupported arguments are passed to dask_cudf.DataFrame.sort_values (#8349) @galipremsagar
Raise NotImplementedError for axis=1 in rank (#8347) @galipremsagar
Add support for make_meta_obj dispatch in dask-cudf (#8342) @galipremsagar
Update Java string concatenate test for single column (#8330) @tgravescs
Use empty_like in scatter (#8314) @revans2
Fix concatenatelistsignorenull on rows of allnulls (#8312) @sperlingxx
Add separator-on-null parameter to strings concatenate APIs (#8282) @davidwendt
COLLECT_LIST support returning empty output columns. (#8279) @mythrocks
Update io util to convert path like object to string (#8275) @ayushdg
Fix result column types for empty inputs to rolling window (#8274) @mythrocks
Actually test equality in assertgroupbyresults_equal (#8272) @shwina
CMake always explicitly specify a source files extension (#8270) @robertmaynard
Fix struct binary search and struct flattening (#8268) @ttnghia
Revert "patch thrust to fix intmax num elements limitation in scanbykey" (#8263) @cwharris
upgrade dlpack to 0.5 (#8262) @cwharris
Fixes CSV-reader type inference for thousands separator and decimal point (#8261) @elstehle
Fix incorrect assertion in Java concat (#8258) @sperlingxx
Copy nested types upon construction (#8244) @isVoid
Preserve column hierarchy when getting NULL row from LIST column (#8206) @isVoid
Clip decimal binary op precision at max precision (#8194) @ChrisJar

📖 Documentation

Add docstring for dask_cudf.read_csv (#8355) @galipremsagar
Fix cudf release version in readme (#8331) @galipremsagar
Fix structs column description in dev docs (#8318) @isVoid
Update readme with correct CUDA versions (#8315) @raydouglass
Add description of the cuIO GDS integration (#8293) @vuule
Remove unused parameter from copy_partition kernel documentation (#8283) @robertmaynard

🚀 New Features

Add support merging b/w categorical data (#8332) @galipremsagar
Java: Support struct scalar (#8327) @sperlingxx
added ishomogeneous property (#8299) @shaneding
Added decimal writing for CSV writer (#8296) @kaatish
Java: Support creating a scalar from utf8 string (#8294) @firestarman
Add Java API for Concatenate strings with separator (#8289) @tgravescs
strings::join_list_elements options for empty list inputs (#8285) @ttnghia
Return python lists for getitem calls to list type series (#8265) @brandon-b-miller
add unit tests for lead/lag on list for row window (#8259) @wbo4958
Create a String column from UTF8 String byte arrays (#8257) @firestarman
Support scattering list_scalar (#8256) @isVoid
Implement lists::concatenate_list_elements (#8231) @ttnghia
Support for struct scalars. (#8220) @nvdbaranec
Add support for decimal types in ORC writer (#8198) @vuule
Support create lists column from a list_scalar (#8185) @isVoid
Groupby.shift c++ API refactor and python binding (#8131) @isVoid
Add groupby::replace_nulls(replace_policy) api (#7118) @isVoid

🛠️ Improvements

Support Dask + Distributed 2021.05.1 (#8392) @jakirkham
Add aliases for string methods (#8353) @shwina
Update environment variable used to determine cuda_version (#8321) @ajschmidt8
JNI: Refactor the code of making column from scalar (#8310) @firestarman
Update CHANGELOG.md links for calver (#8303) @ajschmidt8
Merge branch-0.19 into branch-21.06 (#8302) @ajschmidt8
use address and length for GDS reads/writes (#8301) @rongou
Update cudfjni version to 21.06.0 (#8292) @pxLi
Update docs build script (#8284) @ajschmidt8
Make device_buffer streams explicit and enforce move construction (#8280) @harrism
Introduce a common parent class for NumericalColumn and DecimalColumn (#8278) @vyasr
Do not add nulls to the hash table when nullequality::NOTEQUAL is passed to leftsemijoin and leftantijoin (#8277) @nvdbaranec
Enable implicit casting when concatenating mixed types (#8276) @ChrisJar
Fix CMake FindPackage rmm, pin dev envs' dlpack to v0.3 (#8271) @trxcllnt
Update cudfjni version to 21.06 (#8267) @pxLi
support RMM aligned resource adapter in JNI (#8266) @rongou
Pass compiler environment variables to conda python build (#8260) @Ethyling
Remove abc inheritance from Serializable (#8254) @vyasr
Move more methods into SingleColumnFrame (#8253) @vyasr
Update ORC statistics API to use C++17 standard library (#8241) @vuule
Correct unused parameter warnings in dictonary algorithms (#8239) @robertmaynard
Correct unused parameters in the copying algorithms (#8232) @robertmaynard
IO statistics cleanup (#8191) @kaatish
Refactor of rolling_window implementation. (#8158) @nvdbaranec
Add a flag for allowing single quotes in JSON strings. (#8144) @nvdbaranec
Column refactoring 2 (#8130) @vyasr
support space in workspace (#7956) @jolorunyomi
Support collect_set on rolling window (#7881) @sperlingxx

- C++
Published by GPUtester almost 5 years ago

https://github.com/rapidsai/cudf - v0.19.2

🚨 Breaking Changes

Allow hash_partition to take a seed value (#7771) @magnatelee
Allow merging index column with data column using keyword "on" (#7736) @skirui-source
Change JNI API to avoid loading native dependencies when creating sort order classes. (#7729) @revans2
Replace devicevector with deviceuvector in null_mask (#7715) @harrism
Don't identify decimals as strings. (#7710) @vyasr
Fix Java Parquet write after writer API changes (#7655) @revans2
Convert cudf::concatenate APIs to use spans and device_uvector (#7621) @harrism
Update missing docstring examples in python public APIs (#7546) @galipremsagar
Remove unneeded step parameter from strings::detail::copy_slice (#7525) @davidwendt
Rename ARROWSTATICLIB because it conflicts with one in FindArrow.cmake (#7518) @trxcllnt
Match Pandas logic for comparing two objects with nulls (#7490) @brandon-b-miller
Add struct support to parquet writer (#7461) @devavret
Join APIs that return gathermaps (#7454) @shwina
fixed_point + cudf::binary_operation API Changes (#7435) @codereport
Fix BUG: Exception when PYTHONOPTIMIZE=2 (#7434) @skirui-source
Change nvtext::loadvocabularyfile to return a unique ptr (#7424) @davidwendt
Refactor strings column factories (#7397) @harrism
Use CMAKECUDAARCHITECTURES (#7391) @robertmaynard
Upgrade pandas to 1.2 (#7375) @galipremsagar
Rename logical_cast to bit_cast and allow additional conversions (#7373) @ttnghia
Rework libcudf CMakeLists.txt to export targets for CPM (#7107) @trxcllnt

🐛 Bug Fixes

unsnap: busy wait a number of cycles (#8073) @vuule
Fix returned column type when extracting from an empty list column (#8031) @jlowe
Don't reindex an new value on setitem if the original dataframe was empty (#8026) @vyasr
Fix a NameError in meta dispatch API (#7996) @galipremsagar
Reindex in DataFrame.__setitem__ (#7957) @galipremsagar
jitify direct-to-cubin compilation and caching. (#7919) @cwharris
Use dynamic cudart for nvcomp in java build (#7896) @abellina
fix "incompatible redefinition" warnings (#7894) @cwharris
cudf consistently specifies the cuda runtime (#7887) @robertmaynard
disable verbose output for jitify_preprocess (#7886) @cwharris
CMake jitpreprocessfiles function only runs when needed (#7872) @robertmaynard
Push DeviceScalar construction into cython for list.contains (#7864) @brandon-b-miller
cudf now sets an install rpath of $ORIGIN (#7863) @robertmaynard
Don't install Thrust examples, tests, docs, and python files (#7811) @robertmaynard
Sort by index in groupby tests more consistently (#7802) @shwina
Revert "Update conda recipes pinning of repo dependencies (#7743)" (#7793) @raydouglass
Add decimal column handling in copytypemetadata (#7788) @shwina
Add column names validation in parquet writer (#7786) @galipremsagar
Fix Java explode outer unit tests (#7782) @jlowe
Fix compiler warning about non-POD types passed through ellipsis (#7781) @jrhemstad
User resource fix for replace_nulls (#7769) @magnatelee
Fix type dispatch for columnar replace_nulls (#7768) @jlowe
Add ignore_order parameter to dask-cudf concat dispatch (#7765) @galipremsagar
Fix slicing and arrow representations of decimal columns (#7755) @vyasr
Fixing issue with explode_outer position not nulling position entries of null rows (#7754) @hyperbolic2346
Implement scatter for struct columns (#7752) @ttnghia
Fix data corruption in string columns (#7746) @galipremsagar
Fix string length in stripe dictionary building (#7744) @kaatish
Update conda recipes pinning of repo dependencies (#7743) @mike-wendt
Enable dask dispatch to cuDF's is_categorical_dtype for cuDF objects (#7740) @brandon-b-miller
Fix dictionary size computation in ORC writer (#7737) @vuule
Fix cudf::cast overflow for decimal64 to int32_t or smaller in certain cases (#7733) @codereport
Change JNI API to avoid loading native dependencies when creating sort order classes. (#7729) @revans2
Disable column_view data accessors for unsupported types (#7725) @jrhemstad
Materialize RangeIndex when index=True in parquet writer (#7711) @galipremsagar
Don't identify decimals as strings. (#7710) @vyasr
Fix return type of DataFrame.argsort (#7706) @galipremsagar
Fix/correct cudf installed package requirements (#7688) @robertmaynard
Fix SparkMurmurHash3_32 hash inconsistencies with Apache Spark (#7672) @jlowe
Fix ORC reader issue with reading empty string columns (#7656) @rgsl888prabhu
Fix Java Parquet write after writer API changes (#7655) @revans2
Fixing empty null lists throwing explode_outer for a loop. (#7649) @hyperbolic2346
Fix internal compiler error during JNI Docker build (#7645) @jlowe
Fix Debug build break with deviceuvectors in groupedrolling.cu (#7633) @mythrocks
Parquet reader: Fix issue when using skip_rows on non-nested columns containing nulls (#7627) @nvdbaranec
Fix ORC reader for empty DataFrame/Table (#7624) @rgsl888prabhu
Fix specifying GPU architecture in JNI build (#7612) @jlowe
Fix ORC writer OOM issue (#7605) @vuule
Fix 0.18 --> 0.19 automerge (#7589) @kkraus14
Fix ORC issue with incorrect timestamp nanosecond values (#7581) @vuule
Fix missing Dask imports (#7580) @kkraus14
CMAKECUDAARCHITECTURES doesn't change when build-system invokes cmake (#7579) @robertmaynard
Another fix for offsetsend() iterator in listscolumn_view (#7575) @ttnghia
Fix ORC writer output corruption with string columns (#7565) @vuule
Fix cudf::lists::sort_lists failing for sliced column (#7564) @ttnghia
FIX Fix Anaconda upload args (#7558) @dillon-cullinan
Fix index mismatch issue in equality related APIs (#7555) @galipremsagar
FIX Revert gpucicondaretry on conda file output locations (#7552) @dillon-cullinan
Fix offsetend iterator for listscolumn_view, which was not correctl… (#7551) @ttnghia
Fix no such file dlpack.h error when build libcudf (#7549) @chenrui17
Update missing docstring examples in python public APIs (#7546) @galipremsagar
Decimal32 Build Fix (#7544) @razajafri
FIX Retry conda output location (#7540) @dillon-cullinan
fix missing renames of dask git branches from master to main (#7535) @kkraus14
Remove detail from device_span (#7533) @rwlee
Change dask and distributed branch to main (#7532) @dantegd
Update JNI build to use CUDFUSEARROW_STATIC (#7526) @jlowe
Make sure rmm::rmm CMake target is visibile to cudf users (#7524) @robertmaynard
Fix contiguous_split not properly handling output partitions > 2 GB. (#7515) @nvdbaranec
Change jit launch to safe_launch (#7510) @devavret
Fix comparison between Datetime/Timedelta columns and NULL scalars (#7504) @brandon-b-miller
Fix off-by-one error in char-parallel string scalar replace (#7502) @jlowe
Fix JNI deprecation of all, put it on the wrong version before (#7501) @revans2
Fix Series/Dataframe Mixed Arithmetic (#7491) @brandon-b-miller
Fix JNI build after removal of libcudf sub-libraries (#7486) @jlowe
Correctly compile benchmarks (#7485) @robertmaynard
Fix bool column corruption with ORC Reader (#7483) @rgsl888prabhu
Fix __repr__ for categorical dtype (#7476) @galipremsagar
Java cleaner synchronization (#7474) @abellina
Fix java float/double parsing tests (#7473) @revans2
Pass stream and user resource to makedefaultconstructed_scalar (#7469) @magnatelee
Improve stability of daskcudf.DataFrame.var and daskcudf.DataFrame.std (#7453) @rjzamora
Missing device_storage_dispatch change affecting cudf::gather (#7449) @codereport
fix cuFile JNI compile errors (#7445) @rongou
Support Series.__setitem__ with key to a new row (#7443) @isVoid
Fix BUG: Exception when PYTHONOPTIMIZE=2 (#7434) @skirui-source
Make inclusive scan safe for cases with leading nulls (#7432) @magnatelee
Fix typo in listdeviceview::pairrepend() (#7423) @mythrocks
Fix string to double conversion and row equivalent comparison (#7410) @ttnghia
Fix thrust failure when transfering data from devicevector to hostvector with vectors of size 1 (#7382) @ttnghia
Fix std::exeception catch-by-reference gcc9 compile error (#7380) @davidwendt
Fix skiprows issue with ORC Reader (#7359) @rgsl888prabhu
fix Arrow CMake file (#7358) @rongou
Fix lists::contains() for NaN and Decimals (#7349) @mythrocks
Handle cupy array in Dataframe.__setitem__ (#7340) @galipremsagar
Fix invalid-device-fn error in cudf::strings::replace_re with multiple regex's (#7336) @davidwendt
FIX Add codecov upload block to gpu script (#6860) @dillon-cullinan

📖 Documentation

Fix join API doxygen (#7890) @shwina
Add Resources to README. (#7697) @bdice
Add isin examples in Docstring (#7479) @galipremsagar
Resolving unlinked type shorthands in cudf doc (#7416) @isVoid
Fix typo in regex.md doc page (#7363) @davidwendt
Fix incorrect stringscolumnview::chars_size documentation (#7360) @jlowe

🚀 New Features

Enable basic reductions for decimal columns (#7776) @ChrisJar
Enable join on decimal columns (#7764) @ChrisJar
Allow merging index column with data column using keyword "on" (#7736) @skirui-source
Implement DecimalColumn + Scalar and add cudf.Scalars of Decimal64Dtype (#7732) @brandon-b-miller
Add support for unique groupby aggregation (#7726) @shwina
Expose libcudf's label_bins function to cudf (#7724) @vyasr
Adding support for equi-join on struct (#7720) @hyperbolic2346
Add decimal column comparison operations (#7716) @isVoid
Implement scan operations for decimal columns (#7707) @ChrisJar
Enable typecasting between decimal and int (#7691) @ChrisJar
Enable decimal support in parquet writer (#7673) @devavret
Adds list.unique API (#7664) @isVoid
Fix NaN handling in droplistduplicates (#7662) @ttnghia
Add lists.sort_values API (#7657) @isVoid
Add is_integer API that can check for the validity of a string-to-integer conversion (#7642) @ttnghia
Adds explode API (#7607) @isVoid
Adds list.take, python binding for cudf::lists::segmented_gather (#7591) @isVoid
Implement cudf::label_bins() (#7554) @vyasr
Add Python bindings for lists::contains (#7547) @skirui-source
cudf::rowbitcount() support. (#7534) @nvdbaranec
Implement droplistduplicates (#7528) @ttnghia
Add Python bindings for lists::extract_lists_element (#7505) @skirui-source
Add explodeouter and explodeouter_position (#7499) @hyperbolic2346
Match Pandas logic for comparing two objects with nulls (#7490) @brandon-b-miller
Add struct support to parquet writer (#7461) @devavret
Enable type conversion from float to decimal type (#7450) @ChrisJar
Add cython for converting strings/fixed-point functions (#7429) @davidwendt
Add struct column support to cudf::sort and cudf::sorted_order (#7422) @karthikeyann
Implement groupby collect_set (#7420) @ttnghia
Merge branch-0.18 into branch-0.19 (#7411) @raydouglass
Refactor strings column factories (#7397) @harrism
Add groupby scan operations (sort groupby) (#7387) @karthikeyann
Add cudf::explode_position (#7376) @hyperbolic2346
Add string conversion to/from decimal values libcudf APIs (#7364) @davidwendt
Add groupby SUMOFSQUARES support (#7362) @karthikeyann
Add Series.drop api (#7304) @isVoid
getjsonobject() implementation (#7286) @nvdbaranec
Python API for LIstMethods.len() (#7283) @isVoid
Support null_policy::EXCLUDE for COLLECT rolling aggregation (#7264) @mythrocks
Add support for special tokens in nvtext::subword_tokenizer (#7254) @davidwendt
Fix inplace update of data and add Series.update (#7201) @galipremsagar
Implement cudf::group_by (hash) for decimal32 and decimal64 (#7190) @codereport
Adding support to specify "level" parameter for Dataframe.rename (#7135) @skirui-source

🛠️ Improvements

fix GDS include path for version 0.95 (#7877) @rongou
Update dask + distributed to 2021.4.0 (#7858) @jakirkham
Add ability to extract include dirs from CUDF_HOME (#7848) @galipremsagar
Add USE_GDS as an option in build script (#7833) @pxLi
add an allocate method with stream in java DeviceMemoryBuffer (#7826) @rongou
Constrain dask and distributed versions to 2021.3.1 (#7825) @shwina
Revert dask versioning of concat dispatch (#7823) @galipremsagar
add copy methods in Java memory buffer (#7791) @rongou
Update README and CONTRIBUTING for 0.19 (#7778) @robertmaynard
Allow hash_partition to take a seed value (#7771) @magnatelee
Turn on NVTX by default in java build (#7761) @tgravescs
Add Java bindings to join gather map APIs (#7751) @jlowe
Add replacements column support for Java replaceNulls (#7750) @jlowe
Add Java bindings for rowbitcount (#7749) @jlowe
Remove unused JVM array creation (#7748) @jlowe
Added JNI support for new is_integer (#7739) @revans2
Create and promote library aliases in libcudf installations (#7734) @trxcllnt
Support groupby operations for decimal dtypes (#7731) @vyasr
Memory map the input file only when GDS compatiblity mode is not used (#7717) @vuule
Replace devicevector with deviceuvector in null_mask (#7715) @harrism
Struct hashing support for SerialMurmur3 and SparkMurmur3 (#7714) @jlowe
Add gbenchmark for nvtext replace-tokens function (#7708) @davidwendt
Use stream in groupby calls (#7705) @karthikeyann
Update codeowners file (#7701) @ajschmidt8
Cleanup groupby to use hostspan, devicespan, device_uvector (#7698) @karthikeyann
Add gbenchmark for nvtext ngrams functions (#7693) @davidwendt
Misc Python/Cython optimizations (#7686) @shwina
Add gbenchmark for nvtext tokenize functions (#7684) @davidwendt
Add columndeviceview to orc writer (#7676) @kaatish
cudf_kafka now uses cuDF CMake export targets (CPM) (#7674) @robertmaynard
Add gbenchmark for nvtext normalize functions (#7668) @davidwendt
Resolve unnecessary import of thrust/optional.hpp in types.hpp (#7667) @vyasr
Feature/optimize accessor copy (#7660) @vyasr
Fix find_package(cudf) (#7658) @trxcllnt
Work-around for gcc7 compile error on Centos7 (#7652) @davidwendt
Add in JNI support for count_elements (#7651) @revans2
Fix issues with building cudf in a non-conda environment (#7647) @galipremsagar
Refactor ConfigureCUDA to not conditionally insert compiler flags (#7643) @robertmaynard
Add gbenchmark for converting strings to/from timestamps (#7641) @davidwendt
Handle constructing a cudf.Scalar from a cudf.Scalar (#7639) @shwina
Add in JNI support for table partition (#7637) @revans2
Add explicit fixed_point merge test (#7635) @codereport
Add JNI support for IDENTITY hash partitioning (#7626) @revans2
Java support on explode_outer (#7625) @sperlingxx
Java support of casting string from/to decimal (#7623) @sperlingxx
Convert cudf::concatenate APIs to use spans and device_uvector (#7621) @harrism
Add gbenchmark for cudf::strings::translate function (#7617) @davidwendt
Use file(COPY ) over file(INSTALL ) so cmake output is reduced (#7616) @robertmaynard
Use rmm::deviceuvector in place of rmm::devicevector for ORC reader/writer and cudf::io::column_buffer (#7614) @vuule
Refactor Java host-side buffer concatenation to expose separate steps (#7610) @jlowe
Add gbenchmarks for string substrings functions (#7603) @davidwendt
Refactor string conversion check (#7599) @ttnghia
JNI: Pass names of children struct columns to native Arrow IPC writer (#7598) @firestarman
Revert "ENH Fix stale GHA and prevent duplicates " (#7595) @mike-wendt
ENH Fix stale GHA and prevent duplicates (#7594) @mike-wendt
Fix auto-detecting GPU architectures (#7593) @trxcllnt
Reduce cudf library size (#7583) @robertmaynard
Optimize cudf::makestringscolumn for long strings (#7576) @davidwendt
Always build and export the cudf::cudftestutil target (#7574) @trxcllnt
Eliminate literal parameters to uvector::setelementasync and devicescalar::setvalue (#7563) @harrism
Add gbenchmark for strings::concatenate (#7560) @davidwendt
Update Changelog Link (#7550) @ajschmidt8
Add gbenchmarks for strings replace regex functions (#7541) @davidwendt
Add __repr__ for Column and ColumnAccessor (#7531) @shwina
Support Decimal DIV changes in cudf (#7527) @razajafri
Remove unneeded step parameter from strings::detail::copy_slice (#7525) @davidwendt
Use deviceuvector, devicespan in sort groupby (#7523) @karthikeyann
Add gbenchmarks for strings extract function (#7522) @davidwendt
Rename ARROWSTATICLIB because it conflicts with one in FindArrow.cmake (#7518) @trxcllnt
Reduce compile time/size for scan.cu (#7516) @davidwendt
Change devicevector to deviceuvector in nvtext source files (#7512) @davidwendt
Removed unneeded includes from traits.hpp (#7509) @davidwendt
FIX Remove random build directory generation for ccache (#7508) @dillon-cullinan
xfail failing pytest in pandas 1.2.3 (#7507) @galipremsagar
JNI bit cast (#7493) @revans2
Combine rolling window function tests (#7480) @mythrocks
Prepare Changelog for Automation (#7477) @ajschmidt8
Java support for explode position (#7471) @sperlingxx
Update 0.18 changelog entry (#7463) @ajschmidt8
JNI: Support skipping nulls for collect aggregation (#7457) @firestarman
Join APIs that return gathermaps (#7454) @shwina
Remove dependence on managed memory for multimap test (#7451) @jrhemstad
Use cuFile for Parquet IO when available (#7444) @vuule
Statistics cleanup (#7439) @kaatish
Add gbenchmarks for strings filter functions (#7438) @davidwendt
fixed_point + cudf::binary_operation API Changes (#7435) @codereport
Improve string gather performance (#7433) @jlowe
Don't use user resource for a temporary allocation in sortbykey (#7431) @magnatelee
Detail APIs for datetime functions (#7430) @magnatelee
Replace thrust::maxelement with thrust::reduce in strings findallre (#7428) @davidwendt
Add gbenchmark for strings split/split_record functions (#7427) @davidwendt
Update JNI build to use CMAKECUDAARCHITECTURES (#7425) @jlowe
Change nvtext::loadvocabularyfile to return a unique ptr (#7424) @davidwendt
Simplify type dispatch with device_storage_dispatch (#7419) @codereport
Java support for casting of nested child columns (#7417) @razajafri
Improve scalar string replace performance for long strings (#7415) @jlowe
Remove unneeded temporary device vector for strings scatter specialization (#7409) @davidwendt
bitmask_or implementation with bitmask refactor (#7406) @rwlee
Add other cudf::strings::replace functions to current strings replace gbenchmark (#7403) @davidwendt
Clean up included headers in device_operators.cuh (#7401) @codereport
Move nullable index iterator to indexalator factory (#7399) @davidwendt
ENH Pass ccache variables to conda recipe & use Ninja in CI (#7398) @Ethyling
upgrade maven-antrun-plugin to support maven parallel builds (#7393) @rongou
Add gbenchmark for strings find/contains functions (#7392) @davidwendt
Use CMAKECUDAARCHITECTURES (#7391) @robertmaynard
Refactor libcudf strings::replace to use makestringschildren utility (#7384) @davidwendt
Added in JNI support for out of core sort algorithm (#7381) @revans2
Upgrade pandas to 1.2 (#7375) @galipremsagar
Rename logical_cast to bit_cast and allow additional conversions (#7373) @ttnghia
jitify 2 support (#7372) @cwharris
compile_udf: Cache PTX for similar functions (#7371) @gmarkall
Add string scalar replace benchmark (#7369) @jlowe
Add gbenchmark for strings containsre/countre functions (#7366) @davidwendt
Update orc reader and writer fuzz tests (#7357) @galipremsagar
Improve url_decode performance for long strings (#7353) @jlowe
cudf::ast Small Refactorings (#7352) @codereport
Remove std::cout and print in the scatter test function EmptyListsOfNullableStrings. (#7342) @ttnghia
Use cudf::detail::make_counting_transform_iterator (#7338) @codereport
Change block size parameter from a global to a template param. (#7333) @nvdbaranec
Partial clean up of ORC writer (#7324) @vuule
Add gbenchmark for cudf::strings::to_lower (#7316) @davidwendt
Update Java bindings version to 0.19-SNAPSHOT (#7307) @pxLi
Move cudf::test::make_counting_transform_iterator to cudf/detail/iterator.cuh (#7306) @codereport
Use string literals in fixed_point release_asserts (#7303) @codereport
Fix merge conflicts for #7295 (#7297) @ajschmidt8
Add UTF-8 chars to createrandomcolumn<string_view> benchmark utility (#7292) @davidwendt
Abstracting block reduce and block scan from cuIO kernels with cub apis (#7278) @rgsl888prabhu
Build.sh use cmake --build to drive build system invocation (#7270) @robertmaynard
Refactor dictionary support for reductions any/all (#7242) @davidwendt
Replace stream.value() with stream for stream_view args (#7236) @karthikeyann
Interval index and interval_range (#7182) @marlenezw
avro reader integration tests (#7156) @cwharris
Rework libcudf CMakeLists.txt to export targets for CPM (#7107) @trxcllnt
Adding Interval Dtype (#6984) @marlenezw
Cleaning up for loops with make_(counting_)transform_iterator (#6546) @codereport

- C++
Published by GPUtester about 5 years ago

https://github.com/rapidsai/cudf - v0.19.1

🚨 Breaking Changes

Allow hash_partition to take a seed value (#7771) @magnatelee
Allow merging index column with data column using keyword "on" (#7736) @skirui-source
Change JNI API to avoid loading native dependencies when creating sort order classes. (#7729) @revans2
Replace devicevector with deviceuvector in null_mask (#7715) @harrism
Don't identify decimals as strings. (#7710) @vyasr
Fix Java Parquet write after writer API changes (#7655) @revans2
Convert cudf::concatenate APIs to use spans and device_uvector (#7621) @harrism
Update missing docstring examples in python public APIs (#7546) @galipremsagar
Remove unneeded step parameter from strings::detail::copy_slice (#7525) @davidwendt
Rename ARROWSTATICLIB because it conflicts with one in FindArrow.cmake (#7518) @trxcllnt
Match Pandas logic for comparing two objects with nulls (#7490) @brandon-b-miller
Add struct support to parquet writer (#7461) @devavret
Join APIs that return gathermaps (#7454) @shwina
fixed_point + cudf::binary_operation API Changes (#7435) @codereport
Fix BUG: Exception when PYTHONOPTIMIZE=2 (#7434) @skirui-source
Change nvtext::loadvocabularyfile to return a unique ptr (#7424) @davidwendt
Refactor strings column factories (#7397) @harrism
Use CMAKECUDAARCHITECTURES (#7391) @robertmaynard
Upgrade pandas to 1.2 (#7375) @galipremsagar
Rename logical_cast to bit_cast and allow additional conversions (#7373) @ttnghia
Rework libcudf CMakeLists.txt to export targets for CPM (#7107) @trxcllnt

🐛 Bug Fixes

Fix returned column type when extracting from an empty list column (#8031) @jlowe
Don't reindex an new value on setitem if the original dataframe was empty (#8026) @vyasr
Fix a NameError in meta dispatch API (#7996) @galipremsagar
Reindex in DataFrame.__setitem__ (#7957) @galipremsagar
jitify direct-to-cubin compilation and caching. (#7919) @cwharris
Use dynamic cudart for nvcomp in java build (#7896) @abellina
fix "incompatible redefinition" warnings (#7894) @cwharris
cudf consistently specifies the cuda runtime (#7887) @robertmaynard
disable verbose output for jitify_preprocess (#7886) @cwharris
CMake jitpreprocessfiles function only runs when needed (#7872) @robertmaynard
Push DeviceScalar construction into cython for list.contains (#7864) @brandon-b-miller
cudf now sets an install rpath of $ORIGIN (#7863) @robertmaynard
Don't install Thrust examples, tests, docs, and python files (#7811) @robertmaynard
Sort by index in groupby tests more consistently (#7802) @shwina
Revert "Update conda recipes pinning of repo dependencies (#7743)" (#7793) @raydouglass
Add decimal column handling in copytypemetadata (#7788) @shwina
Add column names validation in parquet writer (#7786) @galipremsagar
Fix Java explode outer unit tests (#7782) @jlowe
Fix compiler warning about non-POD types passed through ellipsis (#7781) @jrhemstad
User resource fix for replace_nulls (#7769) @magnatelee
Fix type dispatch for columnar replace_nulls (#7768) @jlowe
Add ignore_order parameter to dask-cudf concat dispatch (#7765) @galipremsagar
Fix slicing and arrow representations of decimal columns (#7755) @vyasr
Fixing issue with explode_outer position not nulling position entries of null rows (#7754) @hyperbolic2346
Implement scatter for struct columns (#7752) @ttnghia
Fix data corruption in string columns (#7746) @galipremsagar
Fix string length in stripe dictionary building (#7744) @kaatish
Update conda recipes pinning of repo dependencies (#7743) @mike-wendt
Enable dask dispatch to cuDF's is_categorical_dtype for cuDF objects (#7740) @brandon-b-miller
Fix dictionary size computation in ORC writer (#7737) @vuule
Fix cudf::cast overflow for decimal64 to int32_t or smaller in certain cases (#7733) @codereport
Change JNI API to avoid loading native dependencies when creating sort order classes. (#7729) @revans2
Disable column_view data accessors for unsupported types (#7725) @jrhemstad
Materialize RangeIndex when index=True in parquet writer (#7711) @galipremsagar
Don't identify decimals as strings. (#7710) @vyasr
Fix return type of DataFrame.argsort (#7706) @galipremsagar
Fix/correct cudf installed package requirements (#7688) @robertmaynard
Fix SparkMurmurHash3_32 hash inconsistencies with Apache Spark (#7672) @jlowe
Fix ORC reader issue with reading empty string columns (#7656) @rgsl888prabhu
Fix Java Parquet write after writer API changes (#7655) @revans2
Fixing empty null lists throwing explode_outer for a loop. (#7649) @hyperbolic2346
Fix internal compiler error during JNI Docker build (#7645) @jlowe
Fix Debug build break with deviceuvectors in groupedrolling.cu (#7633) @mythrocks
Parquet reader: Fix issue when using skip_rows on non-nested columns containing nulls (#7627) @nvdbaranec
Fix ORC reader for empty DataFrame/Table (#7624) @rgsl888prabhu
Fix specifying GPU architecture in JNI build (#7612) @jlowe
Fix ORC writer OOM issue (#7605) @vuule
Fix 0.18 --> 0.19 automerge (#7589) @kkraus14
Fix ORC issue with incorrect timestamp nanosecond values (#7581) @vuule
Fix missing Dask imports (#7580) @kkraus14
CMAKECUDAARCHITECTURES doesn't change when build-system invokes cmake (#7579) @robertmaynard
Another fix for offsetsend() iterator in listscolumn_view (#7575) @ttnghia
Fix ORC writer output corruption with string columns (#7565) @vuule
Fix cudf::lists::sort_lists failing for sliced column (#7564) @ttnghia
FIX Fix Anaconda upload args (#7558) @dillon-cullinan
Fix index mismatch issue in equality related APIs (#7555) @galipremsagar
FIX Revert gpucicondaretry on conda file output locations (#7552) @dillon-cullinan
Fix offsetend iterator for listscolumn_view, which was not correctl… (#7551) @ttnghia
Fix no such file dlpack.h error when build libcudf (#7549) @chenrui17
Update missing docstring examples in python public APIs (#7546) @galipremsagar
Decimal32 Build Fix (#7544) @razajafri
FIX Retry conda output location (#7540) @dillon-cullinan
fix missing renames of dask git branches from master to main (#7535) @kkraus14
Remove detail from device_span (#7533) @rwlee
Change dask and distributed branch to main (#7532) @dantegd
Update JNI build to use CUDFUSEARROW_STATIC (#7526) @jlowe
Make sure rmm::rmm CMake target is visibile to cudf users (#7524) @robertmaynard
Fix contiguous_split not properly handling output partitions > 2 GB. (#7515) @nvdbaranec
Change jit launch to safe_launch (#7510) @devavret
Fix comparison between Datetime/Timedelta columns and NULL scalars (#7504) @brandon-b-miller
Fix off-by-one error in char-parallel string scalar replace (#7502) @jlowe
Fix JNI deprecation of all, put it on the wrong version before (#7501) @revans2
Fix Series/Dataframe Mixed Arithmetic (#7491) @brandon-b-miller
Fix JNI build after removal of libcudf sub-libraries (#7486) @jlowe
Correctly compile benchmarks (#7485) @robertmaynard
Fix bool column corruption with ORC Reader (#7483) @rgsl888prabhu
Fix __repr__ for categorical dtype (#7476) @galipremsagar
Java cleaner synchronization (#7474) @abellina
Fix java float/double parsing tests (#7473) @revans2
Pass stream and user resource to makedefaultconstructed_scalar (#7469) @magnatelee
Improve stability of daskcudf.DataFrame.var and daskcudf.DataFrame.std (#7453) @rjzamora
Missing device_storage_dispatch change affecting cudf::gather (#7449) @codereport
fix cuFile JNI compile errors (#7445) @rongou
Support Series.__setitem__ with key to a new row (#7443) @isVoid
Fix BUG: Exception when PYTHONOPTIMIZE=2 (#7434) @skirui-source
Make inclusive scan safe for cases with leading nulls (#7432) @magnatelee
Fix typo in listdeviceview::pairrepend() (#7423) @mythrocks
Fix string to double conversion and row equivalent comparison (#7410) @ttnghia
Fix thrust failure when transfering data from devicevector to hostvector with vectors of size 1 (#7382) @ttnghia
Fix std::exeception catch-by-reference gcc9 compile error (#7380) @davidwendt
Fix skiprows issue with ORC Reader (#7359) @rgsl888prabhu
fix Arrow CMake file (#7358) @rongou
Fix lists::contains() for NaN and Decimals (#7349) @mythrocks
Handle cupy array in Dataframe.__setitem__ (#7340) @galipremsagar
Fix invalid-device-fn error in cudf::strings::replace_re with multiple regex's (#7336) @davidwendt
FIX Add codecov upload block to gpu script (#6860) @dillon-cullinan

📖 Documentation

Fix join API doxygen (#7890) @shwina
Add Resources to README. (#7697) @bdice
Add isin examples in Docstring (#7479) @galipremsagar
Resolving unlinked type shorthands in cudf doc (#7416) @isVoid
Fix typo in regex.md doc page (#7363) @davidwendt
Fix incorrect stringscolumnview::chars_size documentation (#7360) @jlowe

🚀 New Features

Enable basic reductions for decimal columns (#7776) @ChrisJar
Enable join on decimal columns (#7764) @ChrisJar
Allow merging index column with data column using keyword "on" (#7736) @skirui-source
Implement DecimalColumn + Scalar and add cudf.Scalars of Decimal64Dtype (#7732) @brandon-b-miller
Add support for unique groupby aggregation (#7726) @shwina
Expose libcudf's label_bins function to cudf (#7724) @vyasr
Adding support for equi-join on struct (#7720) @hyperbolic2346
Add decimal column comparison operations (#7716) @isVoid
Implement scan operations for decimal columns (#7707) @ChrisJar
Enable typecasting between decimal and int (#7691) @ChrisJar
Enable decimal support in parquet writer (#7673) @devavret
Adds list.unique API (#7664) @isVoid
Fix NaN handling in droplistduplicates (#7662) @ttnghia
Add lists.sort_values API (#7657) @isVoid
Add is_integer API that can check for the validity of a string-to-integer conversion (#7642) @ttnghia
Adds explode API (#7607) @isVoid
Adds list.take, python binding for cudf::lists::segmented_gather (#7591) @isVoid
Implement cudf::label_bins() (#7554) @vyasr
Add Python bindings for lists::contains (#7547) @skirui-source
cudf::rowbitcount() support. (#7534) @nvdbaranec
Implement droplistduplicates (#7528) @ttnghia
Add Python bindings for lists::extract_lists_element (#7505) @skirui-source
Add explodeouter and explodeouter_position (#7499) @hyperbolic2346
Match Pandas logic for comparing two objects with nulls (#7490) @brandon-b-miller
Add struct support to parquet writer (#7461) @devavret
Enable type conversion from float to decimal type (#7450) @ChrisJar
Add cython for converting strings/fixed-point functions (#7429) @davidwendt
Add struct column support to cudf::sort and cudf::sorted_order (#7422) @karthikeyann
Implement groupby collect_set (#7420) @ttnghia
Merge branch-0.18 into branch-0.19 (#7411) @raydouglass
Refactor strings column factories (#7397) @harrism
Add groupby scan operations (sort groupby) (#7387) @karthikeyann
Add cudf::explode_position (#7376) @hyperbolic2346
Add string conversion to/from decimal values libcudf APIs (#7364) @davidwendt
Add groupby SUMOFSQUARES support (#7362) @karthikeyann
Add Series.drop api (#7304) @isVoid
getjsonobject() implementation (#7286) @nvdbaranec
Python API for LIstMethods.len() (#7283) @isVoid
Support null_policy::EXCLUDE for COLLECT rolling aggregation (#7264) @mythrocks
Add support for special tokens in nvtext::subword_tokenizer (#7254) @davidwendt
Fix inplace update of data and add Series.update (#7201) @galipremsagar
Implement cudf::group_by (hash) for decimal32 and decimal64 (#7190) @codereport
Adding support to specify "level" parameter for Dataframe.rename (#7135) @skirui-source

🛠️ Improvements

fix GDS include path for version 0.95 (#7877) @rongou
Update dask + distributed to 2021.4.0 (#7858) @jakirkham
Add ability to extract include dirs from CUDF_HOME (#7848) @galipremsagar
Add USE_GDS as an option in build script (#7833) @pxLi
add an allocate method with stream in java DeviceMemoryBuffer (#7826) @rongou
Constrain dask and distributed versions to 2021.3.1 (#7825) @shwina
Revert dask versioning of concat dispatch (#7823) @galipremsagar
add copy methods in Java memory buffer (#7791) @rongou
Update README and CONTRIBUTING for 0.19 (#7778) @robertmaynard
Allow hash_partition to take a seed value (#7771) @magnatelee
Turn on NVTX by default in java build (#7761) @tgravescs
Add Java bindings to join gather map APIs (#7751) @jlowe
Add replacements column support for Java replaceNulls (#7750) @jlowe
Add Java bindings for rowbitcount (#7749) @jlowe
Remove unused JVM array creation (#7748) @jlowe
Added JNI support for new is_integer (#7739) @revans2
Create and promote library aliases in libcudf installations (#7734) @trxcllnt
Support groupby operations for decimal dtypes (#7731) @vyasr
Memory map the input file only when GDS compatiblity mode is not used (#7717) @vuule
Replace devicevector with deviceuvector in null_mask (#7715) @harrism
Struct hashing support for SerialMurmur3 and SparkMurmur3 (#7714) @jlowe
Add gbenchmark for nvtext replace-tokens function (#7708) @davidwendt
Use stream in groupby calls (#7705) @karthikeyann
Update codeowners file (#7701) @ajschmidt8
Cleanup groupby to use hostspan, devicespan, device_uvector (#7698) @karthikeyann
Add gbenchmark for nvtext ngrams functions (#7693) @davidwendt
Misc Python/Cython optimizations (#7686) @shwina
Add gbenchmark for nvtext tokenize functions (#7684) @davidwendt
Add columndeviceview to orc writer (#7676) @kaatish
cudf_kafka now uses cuDF CMake export targets (CPM) (#7674) @robertmaynard
Add gbenchmark for nvtext normalize functions (#7668) @davidwendt
Resolve unnecessary import of thrust/optional.hpp in types.hpp (#7667) @vyasr
Feature/optimize accessor copy (#7660) @vyasr
Fix find_package(cudf) (#7658) @trxcllnt
Work-around for gcc7 compile error on Centos7 (#7652) @davidwendt
Add in JNI support for count_elements (#7651) @revans2
Fix issues with building cudf in a non-conda environment (#7647) @galipremsagar
Refactor ConfigureCUDA to not conditionally insert compiler flags (#7643) @robertmaynard
Add gbenchmark for converting strings to/from timestamps (#7641) @davidwendt
Handle constructing a cudf.Scalar from a cudf.Scalar (#7639) @shwina
Add in JNI support for table partition (#7637) @revans2
Add explicit fixed_point merge test (#7635) @codereport
Add JNI support for IDENTITY hash partitioning (#7626) @revans2
Java support on explode_outer (#7625) @sperlingxx
Java support of casting string from/to decimal (#7623) @sperlingxx
Convert cudf::concatenate APIs to use spans and device_uvector (#7621) @harrism
Add gbenchmark for cudf::strings::translate function (#7617) @davidwendt
Use file(COPY ) over file(INSTALL ) so cmake output is reduced (#7616) @robertmaynard
Use rmm::deviceuvector in place of rmm::devicevector for ORC reader/writer and cudf::io::column_buffer (#7614) @vuule
Refactor Java host-side buffer concatenation to expose separate steps (#7610) @jlowe
Add gbenchmarks for string substrings functions (#7603) @davidwendt
Refactor string conversion check (#7599) @ttnghia
JNI: Pass names of children struct columns to native Arrow IPC writer (#7598) @firestarman
Revert "ENH Fix stale GHA and prevent duplicates " (#7595) @mike-wendt
ENH Fix stale GHA and prevent duplicates (#7594) @mike-wendt
Fix auto-detecting GPU architectures (#7593) @trxcllnt
Reduce cudf library size (#7583) @robertmaynard
Optimize cudf::makestringscolumn for long strings (#7576) @davidwendt
Always build and export the cudf::cudftestutil target (#7574) @trxcllnt
Eliminate literal parameters to uvector::setelementasync and devicescalar::setvalue (#7563) @harrism
Add gbenchmark for strings::concatenate (#7560) @davidwendt
Update Changelog Link (#7550) @ajschmidt8
Add gbenchmarks for strings replace regex functions (#7541) @davidwendt
Add __repr__ for Column and ColumnAccessor (#7531) @shwina
Support Decimal DIV changes in cudf (#7527) @razajafri
Remove unneeded step parameter from strings::detail::copy_slice (#7525) @davidwendt
Use deviceuvector, devicespan in sort groupby (#7523) @karthikeyann
Add gbenchmarks for strings extract function (#7522) @davidwendt
Rename ARROWSTATICLIB because it conflicts with one in FindArrow.cmake (#7518) @trxcllnt
Reduce compile time/size for scan.cu (#7516) @davidwendt
Change devicevector to deviceuvector in nvtext source files (#7512) @davidwendt
Removed unneeded includes from traits.hpp (#7509) @davidwendt
FIX Remove random build directory generation for ccache (#7508) @dillon-cullinan
xfail failing pytest in pandas 1.2.3 (#7507) @galipremsagar
JNI bit cast (#7493) @revans2
Combine rolling window function tests (#7480) @mythrocks
Prepare Changelog for Automation (#7477) @ajschmidt8
Java support for explode position (#7471) @sperlingxx
Update 0.18 changelog entry (#7463) @ajschmidt8
JNI: Support skipping nulls for collect aggregation (#7457) @firestarman
Join APIs that return gathermaps (#7454) @shwina
Remove dependence on managed memory for multimap test (#7451) @jrhemstad
Use cuFile for Parquet IO when available (#7444) @vuule
Statistics cleanup (#7439) @kaatish
Add gbenchmarks for strings filter functions (#7438) @davidwendt
fixed_point + cudf::binary_operation API Changes (#7435) @codereport
Improve string gather performance (#7433) @jlowe
Don't use user resource for a temporary allocation in sortbykey (#7431) @magnatelee
Detail APIs for datetime functions (#7430) @magnatelee
Replace thrust::maxelement with thrust::reduce in strings findallre (#7428) @davidwendt
Add gbenchmark for strings split/split_record functions (#7427) @davidwendt
Update JNI build to use CMAKECUDAARCHITECTURES (#7425) @jlowe
Change nvtext::loadvocabularyfile to return a unique ptr (#7424) @davidwendt
Simplify type dispatch with device_storage_dispatch (#7419) @codereport
Java support for casting of nested child columns (#7417) @razajafri
Improve scalar string replace performance for long strings (#7415) @jlowe
Remove unneeded temporary device vector for strings scatter specialization (#7409) @davidwendt
bitmask_or implementation with bitmask refactor (#7406) @rwlee
Add other cudf::strings::replace functions to current strings replace gbenchmark (#7403) @davidwendt
Clean up included headers in device_operators.cuh (#7401) @codereport
Move nullable index iterator to indexalator factory (#7399) @davidwendt
ENH Pass ccache variables to conda recipe & use Ninja in CI (#7398) @Ethyling
upgrade maven-antrun-plugin to support maven parallel builds (#7393) @rongou
Add gbenchmark for strings find/contains functions (#7392) @davidwendt
Use CMAKECUDAARCHITECTURES (#7391) @robertmaynard
Refactor libcudf strings::replace to use makestringschildren utility (#7384) @davidwendt
Added in JNI support for out of core sort algorithm (#7381) @revans2
Upgrade pandas to 1.2 (#7375) @galipremsagar
Rename logical_cast to bit_cast and allow additional conversions (#7373) @ttnghia
jitify 2 support (#7372) @cwharris
compile_udf: Cache PTX for similar functions (#7371) @gmarkall
Add string scalar replace benchmark (#7369) @jlowe
Add gbenchmark for strings containsre/countre functions (#7366) @davidwendt
Update orc reader and writer fuzz tests (#7357) @galipremsagar
Improve url_decode performance for long strings (#7353) @jlowe
cudf::ast Small Refactorings (#7352) @codereport
Remove std::cout and print in the scatter test function EmptyListsOfNullableStrings. (#7342) @ttnghia
Use cudf::detail::make_counting_transform_iterator (#7338) @codereport
Change block size parameter from a global to a template param. (#7333) @nvdbaranec
Partial clean up of ORC writer (#7324) @vuule
Add gbenchmark for cudf::strings::to_lower (#7316) @davidwendt
Update Java bindings version to 0.19-SNAPSHOT (#7307) @pxLi
Move cudf::test::make_counting_transform_iterator to cudf/detail/iterator.cuh (#7306) @codereport
Use string literals in fixed_point release_asserts (#7303) @codereport
Fix merge conflicts for #7295 (#7297) @ajschmidt8
Add UTF-8 chars to createrandomcolumn<string_view> benchmark utility (#7292) @davidwendt
Abstracting block reduce and block scan from cuIO kernels with cub apis (#7278) @rgsl888prabhu
Build.sh use cmake --build to drive build system invocation (#7270) @robertmaynard
Refactor dictionary support for reductions any/all (#7242) @davidwendt
Replace stream.value() with stream for stream_view args (#7236) @karthikeyann
Interval index and interval_range (#7182) @marlenezw
avro reader integration tests (#7156) @cwharris
Rework libcudf CMakeLists.txt to export targets for CPM (#7107) @trxcllnt
Adding Interval Dtype (#6984) @marlenezw
Cleaning up for loops with make_(counting_)transform_iterator (#6546) @codereport

- C++
Published by GPUtester about 5 years ago

https://github.com/rapidsai/cudf - v0.19.0

🚨 Breaking Changes

Allow hash_partition to take a seed value (#7771) @magnatelee
Allow merging index column with data column using keyword "on" (#7736) @skirui-source
Change JNI API to avoid loading native dependencies when creating sort order classes. (#7729) @revans2
Replace devicevector with deviceuvector in null_mask (#7715) @harrism
Don't identify decimals as strings. (#7710) @vyasr
Fix Java Parquet write after writer API changes (#7655) @revans2
Convert cudf::concatenate APIs to use spans and device_uvector (#7621) @harrism
Update missing docstring examples in python public APIs (#7546) @galipremsagar
Remove unneeded step parameter from strings::detail::copy_slice (#7525) @davidwendt
Rename ARROWSTATICLIB because it conflicts with one in FindArrow.cmake (#7518) @trxcllnt
Match Pandas logic for comparing two objects with nulls (#7490) @brandon-b-miller
Add struct support to parquet writer (#7461) @devavret
Join APIs that return gathermaps (#7454) @shwina
fixed_point + cudf::binary_operation API Changes (#7435) @codereport
Fix BUG: Exception when PYTHONOPTIMIZE=2 (#7434) @skirui-source
Change nvtext::loadvocabularyfile to return a unique ptr (#7424) @davidwendt
Refactor strings column factories (#7397) @harrism
Use CMAKECUDAARCHITECTURES (#7391) @robertmaynard
Upgrade pandas to 1.2 (#7375) @galipremsagar
Rename logical_cast to bit_cast and allow additional conversions (#7373) @ttnghia
Rework libcudf CMakeLists.txt to export targets for CPM (#7107) @trxcllnt

🐛 Bug Fixes

Fix a NameError in meta dispatch API (#7996) @galipremsagar
Reindex in DataFrame.__setitem__ (#7957) @galipremsagar
jitify direct-to-cubin compilation and caching. (#7919) @cwharris
Use dynamic cudart for nvcomp in java build (#7896) @abellina
fix "incompatible redefinition" warnings (#7894) @cwharris
cudf consistently specifies the cuda runtime (#7887) @robertmaynard
disable verbose output for jitify_preprocess (#7886) @cwharris
CMake jitpreprocessfiles function only runs when needed (#7872) @robertmaynard
Push DeviceScalar construction into cython for list.contains (#7864) @brandon-b-miller
cudf now sets an install rpath of $ORIGIN (#7863) @robertmaynard
Don't install Thrust examples, tests, docs, and python files (#7811) @robertmaynard
Sort by index in groupby tests more consistently (#7802) @shwina
Revert "Update conda recipes pinning of repo dependencies (#7743)" (#7793) @raydouglass
Add decimal column handling in copytypemetadata (#7788) @shwina
Add column names validation in parquet writer (#7786) @galipremsagar
Fix Java explode outer unit tests (#7782) @jlowe
Fix compiler warning about non-POD types passed through ellipsis (#7781) @jrhemstad
User resource fix for replace_nulls (#7769) @magnatelee
Fix type dispatch for columnar replace_nulls (#7768) @jlowe
Add ignore_order parameter to dask-cudf concat dispatch (#7765) @galipremsagar
Fix slicing and arrow representations of decimal columns (#7755) @vyasr
Fixing issue with explode_outer position not nulling position entries of null rows (#7754) @hyperbolic2346
Implement scatter for struct columns (#7752) @ttnghia
Fix data corruption in string columns (#7746) @galipremsagar
Fix string length in stripe dictionary building (#7744) @kaatish
Update conda recipes pinning of repo dependencies (#7743) @mike-wendt
Enable dask dispatch to cuDF's is_categorical_dtype for cuDF objects (#7740) @brandon-b-miller
Fix dictionary size computation in ORC writer (#7737) @vuule
Fix cudf::cast overflow for decimal64 to int32_t or smaller in certain cases (#7733) @codereport
Change JNI API to avoid loading native dependencies when creating sort order classes. (#7729) @revans2
Disable column_view data accessors for unsupported types (#7725) @jrhemstad
Materialize RangeIndex when index=True in parquet writer (#7711) @galipremsagar
Don't identify decimals as strings. (#7710) @vyasr
Fix return type of DataFrame.argsort (#7706) @galipremsagar
Fix/correct cudf installed package requirements (#7688) @robertmaynard
Fix SparkMurmurHash3_32 hash inconsistencies with Apache Spark (#7672) @jlowe
Fix ORC reader issue with reading empty string columns (#7656) @rgsl888prabhu
Fix Java Parquet write after writer API changes (#7655) @revans2
Fixing empty null lists throwing explode_outer for a loop. (#7649) @hyperbolic2346
Fix internal compiler error during JNI Docker build (#7645) @jlowe
Fix Debug build break with deviceuvectors in groupedrolling.cu (#7633) @mythrocks
Parquet reader: Fix issue when using skip_rows on non-nested columns containing nulls (#7627) @nvdbaranec
Fix ORC reader for empty DataFrame/Table (#7624) @rgsl888prabhu
Fix specifying GPU architecture in JNI build (#7612) @jlowe
Fix ORC writer OOM issue (#7605) @vuule
Fix 0.18 --> 0.19 automerge (#7589) @kkraus14
Fix ORC issue with incorrect timestamp nanosecond values (#7581) @vuule
Fix missing Dask imports (#7580) @kkraus14
CMAKECUDAARCHITECTURES doesn't change when build-system invokes cmake (#7579) @robertmaynard
Another fix for offsetsend() iterator in listscolumn_view (#7575) @ttnghia
Fix ORC writer output corruption with string columns (#7565) @vuule
Fix cudf::lists::sort_lists failing for sliced column (#7564) @ttnghia
FIX Fix Anaconda upload args (#7558) @dillon-cullinan
Fix index mismatch issue in equality related APIs (#7555) @galipremsagar
FIX Revert gpucicondaretry on conda file output locations (#7552) @dillon-cullinan
Fix offsetend iterator for listscolumn_view, which was not correctl… (#7551) @ttnghia
Fix no such file dlpack.h error when build libcudf (#7549) @chenrui17
Update missing docstring examples in python public APIs (#7546) @galipremsagar
Decimal32 Build Fix (#7544) @razajafri
FIX Retry conda output location (#7540) @dillon-cullinan
fix missing renames of dask git branches from master to main (#7535) @kkraus14
Remove detail from device_span (#7533) @rwlee
Change dask and distributed branch to main (#7532) @dantegd
Update JNI build to use CUDFUSEARROW_STATIC (#7526) @jlowe
Make sure rmm::rmm CMake target is visibile to cudf users (#7524) @robertmaynard
Fix contiguous_split not properly handling output partitions > 2 GB. (#7515) @nvdbaranec
Change jit launch to safe_launch (#7510) @devavret
Fix comparison between Datetime/Timedelta columns and NULL scalars (#7504) @brandon-b-miller
Fix off-by-one error in char-parallel string scalar replace (#7502) @jlowe
Fix JNI deprecation of all, put it on the wrong version before (#7501) @revans2
Fix Series/Dataframe Mixed Arithmetic (#7491) @brandon-b-miller
Fix JNI build after removal of libcudf sub-libraries (#7486) @jlowe
Correctly compile benchmarks (#7485) @robertmaynard
Fix bool column corruption with ORC Reader (#7483) @rgsl888prabhu
Fix __repr__ for categorical dtype (#7476) @galipremsagar
Java cleaner synchronization (#7474) @abellina
Fix java float/double parsing tests (#7473) @revans2
Pass stream and user resource to makedefaultconstructed_scalar (#7469) @magnatelee
Improve stability of daskcudf.DataFrame.var and daskcudf.DataFrame.std (#7453) @rjzamora
Missing device_storage_dispatch change affecting cudf::gather (#7449) @codereport
fix cuFile JNI compile errors (#7445) @rongou
Support Series.__setitem__ with key to a new row (#7443) @isVoid
Fix BUG: Exception when PYTHONOPTIMIZE=2 (#7434) @skirui-source
Make inclusive scan safe for cases with leading nulls (#7432) @magnatelee
Fix typo in listdeviceview::pairrepend() (#7423) @mythrocks
Fix string to double conversion and row equivalent comparison (#7410) @ttnghia
Fix thrust failure when transfering data from devicevector to hostvector with vectors of size 1 (#7382) @ttnghia
Fix std::exeception catch-by-reference gcc9 compile error (#7380) @davidwendt
Fix skiprows issue with ORC Reader (#7359) @rgsl888prabhu
fix Arrow CMake file (#7358) @rongou
Fix lists::contains() for NaN and Decimals (#7349) @mythrocks
Handle cupy array in Dataframe.__setitem__ (#7340) @galipremsagar
Fix invalid-device-fn error in cudf::strings::replace_re with multiple regex's (#7336) @davidwendt
FIX Add codecov upload block to gpu script (#6860) @dillon-cullinan

📖 Documentation

Fix join API doxygen (#7890) @shwina
Add Resources to README. (#7697) @bdice
Add isin examples in Docstring (#7479) @galipremsagar
Resolving unlinked type shorthands in cudf doc (#7416) @isVoid
Fix typo in regex.md doc page (#7363) @davidwendt
Fix incorrect stringscolumnview::chars_size documentation (#7360) @jlowe

🚀 New Features

Enable basic reductions for decimal columns (#7776) @ChrisJar
Enable join on decimal columns (#7764) @ChrisJar
Allow merging index column with data column using keyword "on" (#7736) @skirui-source
Implement DecimalColumn + Scalar and add cudf.Scalars of Decimal64Dtype (#7732) @brandon-b-miller
Add support for unique groupby aggregation (#7726) @shwina
Expose libcudf's label_bins function to cudf (#7724) @vyasr
Adding support for equi-join on struct (#7720) @hyperbolic2346
Add decimal column comparison operations (#7716) @isVoid
Implement scan operations for decimal columns (#7707) @ChrisJar
Enable typecasting between decimal and int (#7691) @ChrisJar
Enable decimal support in parquet writer (#7673) @devavret
Adds list.unique API (#7664) @isVoid
Fix NaN handling in droplistduplicates (#7662) @ttnghia
Add lists.sort_values API (#7657) @isVoid
Add is_integer API that can check for the validity of a string-to-integer conversion (#7642) @ttnghia
Adds explode API (#7607) @isVoid
Adds list.take, python binding for cudf::lists::segmented_gather (#7591) @isVoid
Implement cudf::label_bins() (#7554) @vyasr
Add Python bindings for lists::contains (#7547) @skirui-source
cudf::rowbitcount() support. (#7534) @nvdbaranec
Implement droplistduplicates (#7528) @ttnghia
Add Python bindings for lists::extract_lists_element (#7505) @skirui-source
Add explodeouter and explodeouter_position (#7499) @hyperbolic2346
Match Pandas logic for comparing two objects with nulls (#7490) @brandon-b-miller
Add struct support to parquet writer (#7461) @devavret
Enable type conversion from float to decimal type (#7450) @ChrisJar
Add cython for converting strings/fixed-point functions (#7429) @davidwendt
Add struct column support to cudf::sort and cudf::sorted_order (#7422) @karthikeyann
Implement groupby collect_set (#7420) @ttnghia
Merge branch-0.18 into branch-0.19 (#7411) @raydouglass
Refactor strings column factories (#7397) @harrism
Add groupby scan operations (sort groupby) (#7387) @karthikeyann
Add cudf::explode_position (#7376) @hyperbolic2346
Add string conversion to/from decimal values libcudf APIs (#7364) @davidwendt
Add groupby SUMOFSQUARES support (#7362) @karthikeyann
Add Series.drop api (#7304) @isVoid
getjsonobject() implementation (#7286) @nvdbaranec
Python API for LIstMethods.len() (#7283) @isVoid
Support null_policy::EXCLUDE for COLLECT rolling aggregation (#7264) @mythrocks
Add support for special tokens in nvtext::subword_tokenizer (#7254) @davidwendt
Fix inplace update of data and add Series.update (#7201) @galipremsagar
Implement cudf::group_by (hash) for decimal32 and decimal64 (#7190) @codereport
Adding support to specify "level" parameter for Dataframe.rename (#7135) @skirui-source

🛠️ Improvements

fix GDS include path for version 0.95 (#7877) @rongou
Update dask + distributed to 2021.4.0 (#7858) @jakirkham
Add ability to extract include dirs from CUDF_HOME (#7848) @galipremsagar
Add USE_GDS as an option in build script (#7833) @pxLi
add an allocate method with stream in java DeviceMemoryBuffer (#7826) @rongou
Constrain dask and distributed versions to 2021.3.1 (#7825) @shwina
Revert dask versioning of concat dispatch (#7823) @galipremsagar
add copy methods in Java memory buffer (#7791) @rongou
Update README and CONTRIBUTING for 0.19 (#7778) @robertmaynard
Allow hash_partition to take a seed value (#7771) @magnatelee
Turn on NVTX by default in java build (#7761) @tgravescs
Add Java bindings to join gather map APIs (#7751) @jlowe
Add replacements column support for Java replaceNulls (#7750) @jlowe
Add Java bindings for rowbitcount (#7749) @jlowe
Remove unused JVM array creation (#7748) @jlowe
Added JNI support for new is_integer (#7739) @revans2
Create and promote library aliases in libcudf installations (#7734) @trxcllnt
Support groupby operations for decimal dtypes (#7731) @vyasr
Memory map the input file only when GDS compatiblity mode is not used (#7717) @vuule
Replace devicevector with deviceuvector in null_mask (#7715) @harrism
Struct hashing support for SerialMurmur3 and SparkMurmur3 (#7714) @jlowe
Add gbenchmark for nvtext replace-tokens function (#7708) @davidwendt
Use stream in groupby calls (#7705) @karthikeyann
Update codeowners file (#7701) @ajschmidt8
Cleanup groupby to use hostspan, devicespan, device_uvector (#7698) @karthikeyann
Add gbenchmark for nvtext ngrams functions (#7693) @davidwendt
Misc Python/Cython optimizations (#7686) @shwina
Add gbenchmark for nvtext tokenize functions (#7684) @davidwendt
Add columndeviceview to orc writer (#7676) @kaatish
cudf_kafka now uses cuDF CMake export targets (CPM) (#7674) @robertmaynard
Add gbenchmark for nvtext normalize functions (#7668) @davidwendt
Resolve unnecessary import of thrust/optional.hpp in types.hpp (#7667) @vyasr
Feature/optimize accessor copy (#7660) @vyasr
Fix find_package(cudf) (#7658) @trxcllnt
Work-around for gcc7 compile error on Centos7 (#7652) @davidwendt
Add in JNI support for count_elements (#7651) @revans2
Fix issues with building cudf in a non-conda environment (#7647) @galipremsagar
Refactor ConfigureCUDA to not conditionally insert compiler flags (#7643) @robertmaynard
Add gbenchmark for converting strings to/from timestamps (#7641) @davidwendt
Handle constructing a cudf.Scalar from a cudf.Scalar (#7639) @shwina
Add in JNI support for table partition (#7637) @revans2
Add explicit fixed_point merge test (#7635) @codereport
Add JNI support for IDENTITY hash partitioning (#7626) @revans2
Java support on explode_outer (#7625) @sperlingxx
Java support of casting string from/to decimal (#7623) @sperlingxx
Convert cudf::concatenate APIs to use spans and device_uvector (#7621) @harrism
Add gbenchmark for cudf::strings::translate function (#7617) @davidwendt
Use file(COPY ) over file(INSTALL ) so cmake output is reduced (#7616) @robertmaynard
Use rmm::deviceuvector in place of rmm::devicevector for ORC reader/writer and cudf::io::column_buffer (#7614) @vuule
Refactor Java host-side buffer concatenation to expose separate steps (#7610) @jlowe
Add gbenchmarks for string substrings functions (#7603) @davidwendt
Refactor string conversion check (#7599) @ttnghia
JNI: Pass names of children struct columns to native Arrow IPC writer (#7598) @firestarman
Revert "ENH Fix stale GHA and prevent duplicates " (#7595) @mike-wendt
ENH Fix stale GHA and prevent duplicates (#7594) @mike-wendt
Fix auto-detecting GPU architectures (#7593) @trxcllnt
Reduce cudf library size (#7583) @robertmaynard
Optimize cudf::makestringscolumn for long strings (#7576) @davidwendt
Always build and export the cudf::cudftestutil target (#7574) @trxcllnt
Eliminate literal parameters to uvector::setelementasync and devicescalar::setvalue (#7563) @harrism
Add gbenchmark for strings::concatenate (#7560) @davidwendt
Update Changelog Link (#7550) @ajschmidt8
Add gbenchmarks for strings replace regex functions (#7541) @davidwendt
Add __repr__ for Column and ColumnAccessor (#7531) @shwina
Support Decimal DIV changes in cudf (#7527) @razajafri
Remove unneeded step parameter from strings::detail::copy_slice (#7525) @davidwendt
Use deviceuvector, devicespan in sort groupby (#7523) @karthikeyann
Add gbenchmarks for strings extract function (#7522) @davidwendt
Rename ARROWSTATICLIB because it conflicts with one in FindArrow.cmake (#7518) @trxcllnt
Reduce compile time/size for scan.cu (#7516) @davidwendt
Change devicevector to deviceuvector in nvtext source files (#7512) @davidwendt
Removed unneeded includes from traits.hpp (#7509) @davidwendt
FIX Remove random build directory generation for ccache (#7508) @dillon-cullinan
xfail failing pytest in pandas 1.2.3 (#7507) @galipremsagar
JNI bit cast (#7493) @revans2
Combine rolling window function tests (#7480) @mythrocks
Prepare Changelog for Automation (#7477) @ajschmidt8
Java support for explode position (#7471) @sperlingxx
Update 0.18 changelog entry (#7463) @ajschmidt8
JNI: Support skipping nulls for collect aggregation (#7457) @firestarman
Join APIs that return gathermaps (#7454) @shwina
Remove dependence on managed memory for multimap test (#7451) @jrhemstad
Use cuFile for Parquet IO when available (#7444) @vuule
Statistics cleanup (#7439) @kaatish
Add gbenchmarks for strings filter functions (#7438) @davidwendt
fixed_point + cudf::binary_operation API Changes (#7435) @codereport
Improve string gather performance (#7433) @jlowe
Don't use user resource for a temporary allocation in sortbykey (#7431) @magnatelee
Detail APIs for datetime functions (#7430) @magnatelee
Replace thrust::maxelement with thrust::reduce in strings findallre (#7428) @davidwendt
Add gbenchmark for strings split/split_record functions (#7427) @davidwendt
Update JNI build to use CMAKECUDAARCHITECTURES (#7425) @jlowe
Change nvtext::loadvocabularyfile to return a unique ptr (#7424) @davidwendt
Simplify type dispatch with device_storage_dispatch (#7419) @codereport
Java support for casting of nested child columns (#7417) @razajafri
Improve scalar string replace performance for long strings (#7415) @jlowe
Remove unneeded temporary device vector for strings scatter specialization (#7409) @davidwendt
bitmask_or implementation with bitmask refactor (#7406) @rwlee
Add other cudf::strings::replace functions to current strings replace gbenchmark (#7403) @davidwendt
Clean up included headers in device_operators.cuh (#7401) @codereport
Move nullable index iterator to indexalator factory (#7399) @davidwendt
ENH Pass ccache variables to conda recipe & use Ninja in CI (#7398) @Ethyling
upgrade maven-antrun-plugin to support maven parallel builds (#7393) @rongou
Add gbenchmark for strings find/contains functions (#7392) @davidwendt
Use CMAKECUDAARCHITECTURES (#7391) @robertmaynard
Refactor libcudf strings::replace to use makestringschildren utility (#7384) @davidwendt
Added in JNI support for out of core sort algorithm (#7381) @revans2
Upgrade pandas to 1.2 (#7375) @galipremsagar
Rename logical_cast to bit_cast and allow additional conversions (#7373) @ttnghia
jitify 2 support (#7372) @cwharris
compile_udf: Cache PTX for similar functions (#7371) @gmarkall
Add string scalar replace benchmark (#7369) @jlowe
Add gbenchmark for strings containsre/countre functions (#7366) @davidwendt
Update orc reader and writer fuzz tests (#7357) @galipremsagar
Improve url_decode performance for long strings (#7353) @jlowe
cudf::ast Small Refactorings (#7352) @codereport
Remove std::cout and print in the scatter test function EmptyListsOfNullableStrings. (#7342) @ttnghia
Use cudf::detail::make_counting_transform_iterator (#7338) @codereport
Change block size parameter from a global to a template param. (#7333) @nvdbaranec
Partial clean up of ORC writer (#7324) @vuule
Add gbenchmark for cudf::strings::to_lower (#7316) @davidwendt
Update Java bindings version to 0.19-SNAPSHOT (#7307) @pxLi
Move cudf::test::make_counting_transform_iterator to cudf/detail/iterator.cuh (#7306) @codereport
Use string literals in fixed_point release_asserts (#7303) @codereport
Fix merge conflicts for #7295 (#7297) @ajschmidt8
Add UTF-8 chars to createrandomcolumn<string_view> benchmark utility (#7292) @davidwendt
Abstracting block reduce and block scan from cuIO kernels with cub apis (#7278) @rgsl888prabhu
Build.sh use cmake --build to drive build system invocation (#7270) @robertmaynard
Refactor dictionary support for reductions any/all (#7242) @davidwendt
Replace stream.value() with stream for stream_view args (#7236) @karthikeyann
Interval index and interval_range (#7182) @marlenezw
avro reader integration tests (#7156) @cwharris
Rework libcudf CMakeLists.txt to export targets for CPM (#7107) @trxcllnt
Adding Interval Dtype (#6984) @marlenezw
Cleaning up for loops with make_(counting_)transform_iterator (#6546) @codereport

- C++
Published by GPUtester about 5 years ago

https://github.com/rapidsai/cudf - v0.18.1

- C++
Published by GPUtester about 5 years ago

https://github.com/rapidsai/cudf - [NIGHTLY] v0.18.0

🔗 Links

🚨 Breaking Changes

Default groupby to sort=False (#7180) @isVoid
Add libcudf API for parsing of ORC statistics (#7136) @vuule
Replace ORC writer api with class (#7099) @rgsl888prabhu
Pack/unpack functionality to convert tables to and from a serialized format. (#7096) @nvdbaranec
Replace parquet writer api with class (#7058) @rgsl888prabhu
Add days check to cudf::is_timestamp using cuda::std::chrono classes (#7028) @davidwendt
Fix default parameter values of write_csv and write_parquet (#6967) @vuule
Align Series.groupby API to match Pandas (#6964) @kkraus14
Share factorize implementation with Index and cudf module (#6885) @brandon-b-miller

🐛 Bug Fixes

Fix null-bounds calculation for ranged window queries (#7568) @mythrocks
Remove incorrect std::move call on return variable (#7319) @davidwendt
Fix failing CI ORC test (#7313) @vuule
Disallow constructing frames from a ColumnAccessor (#7298) @shwina
fix java cuFile tests (#7296) @rongou
Fix style issues related to NumPy (#7279) @shwina
Fix bug when iloc slice terminates at before-the-zero position (#7277) @isVoid
Fix copying dtype metadata after calling libcudf functions (#7271) @shwina
Move lists utility function definition out of header (#7266) @mythrocks
Throw if bool column would cause incorrect result when writing to ORC (#7261) @vuule
Use uvector in replace_nulls; Fix sort_helper::grouped_value doc (#7256) @isVoid
Remove floating point types from cudf::sort fast-path (#7250) @davidwendt
Disallow picking output columns from nested columns. (#7248) @devavret
Fix loc for Series with a MultiIndex (#7243) @shwina
Fix Arrow column test leaks (#7241) @tgravescs
Fix test column vector leak (#7238) @kuhushukla
Fix some bugs in java scalar support for decimal (#7237) @revans2
Improve assert_eq handling of scalar (#7220) @isVoid
Fix missing null_count() comparison in test framework and related failures (#7219) @nvdbaranec
Remove floating point types from radix sort fast-path (#7215) @davidwendt
Fixing parquet benchmarks (#7214) @rgsl888prabhu
Handle various parameter combinations in replace API (#7207) @galipremsagar
Export mock aws credentials for s3 tests (#7176) @ayushdg
Add MultiIndex.rename API (#7172) @isVoid
Fix importing list & struct types in from_arrow (#7162) @galipremsagar
Fixing parquet precision writing failing if scale is equal to precision (#7146) @hyperbolic2346
Update s3 tests to use moto_server (#7144) @ayushdg
Fix JIT cache multi-process test flakiness in slow drives (#7142) @devavret
Fix compilation errors in libcudf (#7138) @galipremsagar
Fix compilation failure caused by -Wall addition. (#7134) @codereport
Add informative error message for sep in CSV writer (#7095) @galipremsagar
Add JIT cache per compute capability (#7090) @devavret
Implement __hash__ method for ListDtype (#7081) @galipremsagar
Only upload packages that were built (#7077) @raydouglass
Fix comparisons between Series and cudf.NA (#7072) @brandon-b-miller
Handle nan values correctly in Series.one_hot_encoding (#7059) @galipremsagar
Add unstack() support for non-multiindexed dataframes (#7054) @isVoid
Fix read_orc for decimal type (#7034) @rgsl888prabhu
Fix backward compatibility of loading a 0.16 pkl file (#7033) @galipremsagar
Decimal casts in JNI became a NOOP (#7032) @revans2
Restore usual instance/subclass checking to cudf.DateOffset (#7029) @shwina
Add days check to cudf::is_timestamp using cuda::std::chrono classes (#7028) @davidwendt
Fix to_csv delimiter handling of timestamp format (#7023) @davidwendt
Pin librdkakfa to gcc 7 compatible version (#7021) @raydouglass
Fix fillna & dropna to also consider np.nan as a missing value (#7019) @galipremsagar
Fix round operator's HALF_EVEN computation for negative integers (#7014) @nartal1
Skip Thrust sort patch if already applied (#7009) @harrism
Fix cudf::hash_partition for decimal32 and decimal64 (#7006) @codereport
Fix Thrust unroll patch command (#7002) @harrism
Fix loc behaviour when key of incorrect type is used (#6993) @shwina
Fix int to datetime conversion in csv_read (#6991) @kaatish
fix excluding cufile tests by default (#6988) @rongou
Fix java cufile tests when cufile is not installed (#6987) @revans2
Make cudf::round for fixed_point when scale = -decimal_places a no-op (#6975) @codereport
Fix type comparison for java (#6970) @revans2
Fix default parameter values of write_csv and write_parquet (#6967) @vuule
Align Series.groupby API to match Pandas (#6964) @kkraus14
Fix timestamp parsing in ORC reader for timezones without transitions (#6959) @vuule
Fix typo in numerical.py (#6957) @rgsl888prabhu
fixed_point_value double-shifts in fixed_point construction (#6950) @codereport
fix libcu++ include path for jni (#6948) @rongou
Fix groupby agg/apply behaviour when no key columns are provided (#6945) @shwina
Avoid inserting null elements into join hash table when nulls are treated as unequal (#6943) @hyperbolic2346
Fix cudf::merge gtest for dictionary columns (#6942) @davidwendt
Pass numeric scalars of the same dtype through numeric binops (#6938) @brandon-b-miller
Fix N/A detection for empty fields in CSV reader (#6922) @vuule
Fix rmm_mode=managed parameter for gtests (#6912) @davidwendt
Fix nullmask offset handling in parquet and orc writer (#6889) @kaatish
Correct the sampling range when sampling with replacement (#6884) @ChrisJar
Handle nested string columns with no children in contiguous_split. (#6864) @nvdbaranec
Fix columns & index handling in dataframe constructor (#6838) @galipremsagar

📖 Documentation

Update readme (#7318) @shwina
Fix typo in cudf.core.column.string.extract docs (#7253) @adelevie
Update doxyfile project number (#7161) @davidwendt
Update 10 minutes to cuDF and CuPy with new APIs (#7158) @ChrisJar
Cross link RMM & libcudf Doxygen docs (#7149) @ajschmidt8
Add documentation for support dtypes in all IO formats (#7139) @galipremsagar
Add groupby docs (#7100) @shwina
Update cudf python docstrings with new null representation (<NA>) (#7050) @galipremsagar
Make Doxygen comments formatting consistent (#7041) @vuule
Add docs for working with missing data (#7010) @galipremsagar
Remove warning in fromdlpack and todlpack methods (#7001) @miguelusque
libcudf Developer Guide (#6977) @harrism
Add JNI wrapper for the cuFile API (GDS) (#6940) @rongou

🚀 New Features

Support numeric_only field for rank() (#7213) @isVoid
Add support for cudf::binary_operation TRUE_DIV for decimal32 and decimal64 (#7198) @codereport
Implement COLLECT rolling window aggregation (#7189) @mythrocks
Add support for array-like inputs in cudf.get_dummies (#7181) @galipremsagar
Default groupby to sort=False (#7180) @isVoid
Add libcudf lists column count_elements API (#7173) @davidwendt
Implement cudf::group_by (sort) for decimal32 and decimal64 (#7169) @codereport
Add encoding and compression argument to CSV writer (#7168) @VibhuJawa
cudf::rolling_window SUM support for decimal32 and decimal64 (#7147) @codereport
Adding support for explode to cuDF (#7140) @hyperbolic2346
Add libcudf API for parsing of ORC statistics (#7136) @vuule
update GDS/cuFile location for 0.9 release (#7131) @rongou
Add Segmented sort (#7122) @karthikeyann
Add cudf::binary_operation NULL_MIN, NULL_MAX & NULL_EQUALS for decimal32 and decimal64 (#7119) @codereport
Add scale and value methods to fixed_point (#7109) @codereport
Replace ORC writer api with class (#7099) @rgsl888prabhu
Pack/unpack functionality to convert tables to and from a serialized format. (#7096) @nvdbaranec
Improve digitize API (#7071) @isVoid
Add List types support in data generator (#7064) @galipremsagar
cudf::scan support for decimal32 and decimal64 (#7063) @codereport
cudf::rolling ROW_NUMBER support for decimal32 and decimal64 (#7061) @codereport
Replace parquet writer api with class (#7058) @rgsl888prabhu
Support contains() on lists of primitives (#7039) @mythrocks
Implement cudf::rolling for decimal32 and decimal64 (#7037) @codereport
Add ffill and bfill to string columns (#7036) @isVoid
Enable round in cudf for DataFrame and Series (#7022) @ChrisJar
Extend replace_nulls_policy to string and dictionary type (#7004) @isVoid
Add segmentedgather(listcolumn, gather_list) (#7003) @karthikeyann
Add method field to fillna for fixed width columns (#6998) @isVoid
Manual merge of branch 0.17 into branch 0.18 (#6995) @shwina
Implement cudf::reduce for decimal32 and decimal64 (part 2) (#6980) @codereport
Add Ufunc alias look up for appropriate numpy ufunc dispatching (#6973) @VibhuJawa
Add pytest-xdist to dev environment.yml (#6958) @galipremsagar
Add Index.set_names api (#6929) @galipremsagar
Add replace_null API with replace_policy parameter, fixed_width column support (#6907) @isVoid
Share factorize implementation with Index and cudf module (#6885) @brandon-b-miller
Implement update() function (#6883) @skirui-source
Add groupby idxmin, idxmax aggregation (#6856) @karthikeyann
Implement cudf::reduce for decimal32 and decimal64 (part 1) (#6814) @codereport
Implement cudf.DateOffset for months (#6775) @brandon-b-miller
Add Python DecimalColumn (#6715) @shwina
Add dictionary support to libcudf groupby functions (#6585) @davidwendt

🛠️ Improvements

Update stale GHA with exemptions & new labels (#7395) @mike-wendt
Add GHA to mark issues/prs as stale/rotten (#7388) @Ethyling
Unpin from numpy < 1.20 (#7335) @shwina
Prepare Changelog for Automation (#7309) @galipremsagar
Prepare Changelog for Automation (#7272) @ajschmidt8
Add JNI support for converting Arrow buffers to CUDF ColumnVectors (#7222) @tgravescs
Add coverage for skiprows and num_rows in parquet reader fuzz testing (#7216) @galipremsagar
Define and implement more behavior for merging on categorical variables (#7209) @brandon-b-miller
Add CudfSeriesGroupBy to optimize dask_cudf groupby-mean (#7194) @rjzamora
Add dictionary column support to rolling_window (#7186) @davidwendt
Modify the semantics of end pointers in cuIO to match standard library (#7179) @vuule
Adding unit tests for fixed_point with extremely large scales (#7178) @codereport
Fast path single column sort (#7167) @davidwendt
Fix -Werror=sign-compare errors in device code (#7164) @trxcllnt
Refactor cudf::string_view host and device code (#7159) @davidwendt
Enable logic for GPU auto-detection in cudfjni (#7155) @gerashegalov
Java bindings for Fixed-point type support for Parquet (#7153) @razajafri
Add Java interface for the new API 'explode' (#7151) @firestarman
Replace offsets with iterators in cuIO utilities and CSV parser (#7150) @vuule
Add gbenchmarks for reduction aggregations any() and all() (#7129) @davidwendt
Update JNI for contiguous_split packed results (#7127) @jlowe
Add JNI and Java bindings for list_contains (#7125) @kuhushukla
Add Java unit tests for window aggregate 'collect' (#7121) @firestarman
verify window operations on decimal with java tests (#7120) @sperlingxx
Adds in JNI support for creating an list column from existing columns (#7112) @revans2
Build libcudf with -Wall (#7105) @trxcllnt
Add columndeviceview pointers to EncColumnDesc (#7097) @kaatish
Add pyorc to dev environment (#7085) @galipremsagar
JNI support for creating struct column from existing columns and fixed bug in struct with no children (#7084) @revans2
Fastpath single strings column in cudf::sort (#7075) @davidwendt
Upgrade nvcomp to 1.2.1 (#7069) @rongou
Refactor ORC ProtobufReader to make it more extendable (#7055) @vuule
Add Java tests for decimal casts (#7051) @sperlingxx
Auto-label PRs based on their content (#7044) @jolorunyomi
Create sort gbenchmark for strings column (#7040) @davidwendt
Refactor io memory fetches to use hostdevice_vector methods (#7035) @ChrisJar
Spark Murmur3 hash functionality (#7024) @rwlee
Fix libcudf strings logic where size_type is used to access INT32 column data (#7020) @davidwendt
Adding decimal writing support to parquet (#7017) @hyperbolic2346
Add compression="infer" as default for daskcudf.readcsv (#7013) @rjzamora
Correct ORC docstring; other minor cuIO improvements (#7012) @vuule
Reduce number of hostdevice_vector allocations in parquet reader (#7005) @devavret
Check output size overflow on strings gather (#6997) @davidwendt
Improve representation of MultiIndex (#6992) @galipremsagar
Disable some pragma unroll statements in thrust sort.h (#6982) @davidwendt
Minor cudf::round internal refactoring (#6976) @codereport
Add Java bindings for URL conversion (#6972) @jlowe
Enable strictdecimaltypes in parquet reading (#6969) @sperlingxx
Add in basic support to JNI for logical_cast (#6954) @revans2
Remove duplicate file array_tests.cpp (#6953) @karthikeyann
Add null mask fixed_point_column_wrapper constructors (#6951) @codereport
Update Java bindings version to 0.18-SNAPSHOT (#6949) @jlowe
Use simplified rmm::exec_policy (#6939) @harrism
Add null count test for applybooleanmask (#6903) @harrism
Implement DataFrame.quantile for datetime and timedelta data types (#6902) @ChrisJar
Remove **kwargs from string/categorical methods (#6750) @shwina
Refactor rolling.cu to reduce compile time (#6512) @mythrocks
Add static type checking via Mypy (#6381) @shwina
Update to official libcu++ on Github (#6275) @trxcllnt

- C++
Published by rapids-bot[bot] about 5 years ago

https://github.com/rapidsai/cudf - v0.18.0

Breaking Changes 🚨

Default groupby to sort=False (#7180) @isVoid
Add libcudf API for parsing of ORC statistics (#7136) @vuule
Replace ORC writer api with class (#7099) @rgsl888prabhu
Pack/unpack functionality to convert tables to and from a serialized format. (#7096) @nvdbaranec
Replace parquet writer api with class (#7058) @rgsl888prabhu
Add days check to cudf::is_timestamp using cuda::std::chrono classes (#7028) @davidwendt
Fix default parameter values of write_csv and write_parquet (#6967) @vuule
Align Series.groupby API to match Pandas (#6964) @kkraus14
Share factorize implementation with Index and cudf module (#6885) @brandon-b-miller

Bug Fixes 🐛

Remove incorrect std::move call on return variable (#7319) @davidwendt
Fix failing CI ORC test (#7313) @vuule
Disallow constructing frames from a ColumnAccessor (#7298) @shwina
fix java cuFile tests (#7296) @rongou
Fix style issues related to NumPy (#7279) @shwina
Fix bug when iloc slice terminates at before-the-zero position (#7277) @isVoid
Fix copying dtype metadata after calling libcudf functions (#7271) @shwina
Move lists utility function definition out of header (#7266) @mythrocks
Throw if bool column would cause incorrect result when writing to ORC (#7261) @vuule
Use uvector in replace_nulls; Fix sort_helper::grouped_value doc (#7256) @isVoid
Remove floating point types from cudf::sort fast-path (#7250) @davidwendt
Disallow picking output columns from nested columns. (#7248) @devavret
Fix loc for Series with a MultiIndex (#7243) @shwina
Fix Arrow column test leaks (#7241) @tgravescs
Fix test column vector leak (#7238) @kuhushukla
Fix some bugs in java scalar support for decimal (#7237) @revans2
Improve assert_eq handling of scalar (#7220) @isVoid
Fix missing null_count() comparison in test framework and related failures (#7219) @nvdbaranec
Remove floating point types from radix sort fast-path (#7215) @davidwendt
Fixing parquet benchmarks (#7214) @rgsl888prabhu
Handle various parameter combinations in replace API (#7207) @galipremsagar
Export mock aws credentials for s3 tests (#7176) @ayushdg
Add MultiIndex.rename API (#7172) @isVoid
Fix importing list & struct types in from_arrow (#7162) @galipremsagar
Fixing parquet precision writing failing if scale is equal to precision (#7146) @hyperbolic2346
Update s3 tests to use moto_server (#7144) @ayushdg
Fix JIT cache multi-process test flakiness in slow drives (#7142) @devavret
Fix compilation errors in libcudf (#7138) @galipremsagar
Fix compilation failure caused by -Wall addition. (#7134) @codereport
Add informative error message for sep in CSV writer (#7095) @galipremsagar
Add JIT cache per compute capability (#7090) @devavret
Implement __hash__ method for ListDtype (#7081) @galipremsagar
Only upload packages that were built (#7077) @raydouglass
Fix comparisons between Series and cudf.NA (#7072) @brandon-b-miller
Handle nan values correctly in Series.one_hot_encoding (#7059) @galipremsagar
Add unstack() support for non-multiindexed dataframes (#7054) @isVoid
Fix read_orc for decimal type (#7034) @rgsl888prabhu
Fix backward compatibility of loading a 0.16 pkl file (#7033) @galipremsagar
Decimal casts in JNI became a NOOP (#7032) @revans2
Restore usual instance/subclass checking to cudf.DateOffset (#7029) @shwina
Add days check to cudf::is_timestamp using cuda::std::chrono classes (#7028) @davidwendt
Fix to_csv delimiter handling of timestamp format (#7023) @davidwendt
Pin librdkakfa to gcc 7 compatible version (#7021) @raydouglass
Fix fillna & dropna to also consider np.nan as a missing value (#7019) @galipremsagar
Fix round operator's HALF_EVEN computation for negative integers (#7014) @nartal1
Skip Thrust sort patch if already applied (#7009) @harrism
Fix cudf::hash_partition for decimal32 and decimal64 (#7006) @codereport
Fix Thrust unroll patch command (#7002) @harrism
Fix loc behaviour when key of incorrect type is used (#6993) @shwina
Fix int to datetime conversion in csv_read (#6991) @kaatish
fix excluding cufile tests by default (#6988) @rongou
Fix java cufile tests when cufile is not installed (#6987) @revans2
Make cudf::round for fixed_point when scale = -decimal_places a no-op (#6975) @codereport
Fix type comparison for java (#6970) @revans2
Fix default parameter values of write_csv and write_parquet (#6967) @vuule
Align Series.groupby API to match Pandas (#6964) @kkraus14
Fix timestamp parsing in ORC reader for timezones without transitions (#6959) @vuule
Fix typo in numerical.py (#6957) @rgsl888prabhu
fixed_point_value double-shifts in fixed_point construction (#6950) @codereport
fix libcu++ include path for jni (#6948) @rongou
Fix groupby agg/apply behaviour when no key columns are provided (#6945) @shwina
Avoid inserting null elements into join hash table when nulls are treated as unequal (#6943) @hyperbolic2346
Fix cudf::merge gtest for dictionary columns (#6942) @davidwendt
Pass numeric scalars of the same dtype through numeric binops (#6938) @brandon-b-miller
Fix N/A detection for empty fields in CSV reader (#6922) @vuule
Fix rmm_mode=managed parameter for gtests (#6912) @davidwendt
Fix nullmask offset handling in parquet and orc writer (#6889) @kaatish
Correct the sampling range when sampling with replacement (#6884) @ChrisJar
Handle nested string columns with no children in contiguous_split. (#6864) @nvdbaranec
Fix columns & index handling in dataframe constructor (#6838) @galipremsagar

Documentation 📖

Update readme (#7318) @shwina
Fix typo in cudf.core.column.string.extract docs (#7253) @adelevie
Update doxyfile project number (#7161) @davidwendt
Update 10 minutes to cuDF and CuPy with new APIs (#7158) @ChrisJar
Cross link RMM & libcudf Doxygen docs (#7149) @ajschmidt8
Add documentation for support dtypes in all IO formats (#7139) @galipremsagar
Add groupby docs (#7100) @shwina
Update cudf python docstrings with new null representation (<NA>) (#7050) @galipremsagar
Make Doxygen comments formatting consistent (#7041) @vuule
Add docs for working with missing data (#7010) @galipremsagar
Remove warning in fromdlpack and todlpack methods (#7001) @miguelusque
libcudf Developer Guide (#6977) @harrism
Add JNI wrapper for the cuFile API (GDS) (#6940) @rongou

New Features 🚀

Support numeric_only field for rank() (#7213) @isVoid
Add support for cudf::binary_operation TRUE_DIV for decimal32 and decimal64 (#7198) @codereport
Implement COLLECT rolling window aggregation (#7189) @mythrocks
Add support for array-like inputs in cudf.get_dummies (#7181) @galipremsagar
Default groupby to sort=False (#7180) @isVoid
Add libcudf lists column count_elements API (#7173) @davidwendt
Implement cudf::group_by (sort) for decimal32 and decimal64 (#7169) @codereport
Add encoding and compression argument to CSV writer (#7168) @VibhuJawa
cudf::rolling_window SUM support for decimal32 and decimal64 (#7147) @codereport
Adding support for explode to cuDF (#7140) @hyperbolic2346
Add libcudf API for parsing of ORC statistics (#7136) @vuule
update GDS/cuFile location for 0.9 release (#7131) @rongou
Add Segmented sort (#7122) @karthikeyann
Add cudf::binary_operation NULL_MIN, NULL_MAX & NULL_EQUALS for decimal32 and decimal64 (#7119) @codereport
Add scale and value methods to fixed_point (#7109) @codereport
Replace ORC writer api with class (#7099) @rgsl888prabhu
Pack/unpack functionality to convert tables to and from a serialized format. (#7096) @nvdbaranec
Improve digitize API (#7071) @isVoid
Add List types support in data generator (#7064) @galipremsagar
cudf::scan support for decimal32 and decimal64 (#7063) @codereport
cudf::rolling ROW_NUMBER support for decimal32 and decimal64 (#7061) @codereport
Replace parquet writer api with class (#7058) @rgsl888prabhu
Support contains() on lists of primitives (#7039) @mythrocks
Implement cudf::rolling for decimal32 and decimal64 (#7037) @codereport
Add ffill and bfill to string columns (#7036) @isVoid
Enable round in cudf for DataFrame and Series (#7022) @ChrisJar
Extend replace_nulls_policy to string and dictionary type (#7004) @isVoid
Add segmentedgather(listcolumn, gather_list) (#7003) @karthikeyann
Add method field to fillna for fixed width columns (#6998) @isVoid
Manual merge of branch 0.17 into branch 0.18 (#6995) @shwina
Implement cudf::reduce for decimal32 and decimal64 (part 2) (#6980) @codereport
Add Ufunc alias look up for appropriate numpy ufunc dispatching (#6973) @VibhuJawa
Add pytest-xdist to dev environment.yml (#6958) @galipremsagar
Add Index.set_names api (#6929) @galipremsagar
Add replace_null API with replace_policy parameter, fixed_width column support (#6907) @isVoid
Share factorize implementation with Index and cudf module (#6885) @brandon-b-miller
Implement update() function (#6883) @skirui-source
Add groupby idxmin, idxmax aggregation (#6856) @karthikeyann
Implement cudf::reduce for decimal32 and decimal64 (part 1) (#6814) @codereport
Implement cudf.DateOffset for months (#6775) @brandon-b-miller
Add Python DecimalColumn (#6715) @shwina
Add dictionary support to libcudf groupby functions (#6585) @davidwendt

Improvements 🛠️

Update stale GHA with exemptions & new labels (#7395) @mike-wendt
Add GHA to mark issues/prs as stale/rotten (#7388) @Ethyling
Unpin from numpy < 1.20 (#7335) @shwina
Prepare Changelog for Automation (#7309) @galipremsagar
Prepare Changelog for Automation (#7272) @ajschmidt8
Add JNI support for converting Arrow buffers to CUDF ColumnVectors (#7222) @tgravescs
Add coverage for skiprows and num_rows in parquet reader fuzz testing (#7216) @galipremsagar
Define and implement more behavior for merging on categorical variables (#7209) @brandon-b-miller
Add CudfSeriesGroupBy to optimize dask_cudf groupby-mean (#7194) @rjzamora
Add dictionary column support to rolling_window (#7186) @davidwendt
Modify the semantics of end pointers in cuIO to match standard library (#7179) @vuule
Adding unit tests for fixed_point with extremely large scales (#7178) @codereport
Fast path single column sort (#7167) @davidwendt
Fix -Werror=sign-compare errors in device code (#7164) @trxcllnt
Refactor cudf::string_view host and device code (#7159) @davidwendt
Enable logic for GPU auto-detection in cudfjni (#7155) @gerashegalov
Java bindings for Fixed-point type support for Parquet (#7153) @razajafri
Add Java interface for the new API 'explode' (#7151) @firestarman
Replace offsets with iterators in cuIO utilities and CSV parser (#7150) @vuule
Add gbenchmarks for reduction aggregations any() and all() (#7129) @davidwendt
Update JNI for contiguous_split packed results (#7127) @jlowe
Add JNI and Java bindings for list_contains (#7125) @kuhushukla
Add Java unit tests for window aggregate 'collect' (#7121) @firestarman
verify window operations on decimal with java tests (#7120) @sperlingxx
Adds in JNI support for creating an list column from existing columns (#7112) @revans2
Build libcudf with -Wall (#7105) @trxcllnt
Add columndeviceview pointers to EncColumnDesc (#7097) @kaatish
Add pyorc to dev environment (#7085) @galipremsagar
JNI support for creating struct column from existing columns and fixed bug in struct with no children (#7084) @revans2
Fastpath single strings column in cudf::sort (#7075) @davidwendt
Upgrade nvcomp to 1.2.1 (#7069) @rongou
Refactor ORC ProtobufReader to make it more extendable (#7055) @vuule
Add Java tests for decimal casts (#7051) @sperlingxx
Auto-label PRs based on their content (#7044) @jolorunyomi
Create sort gbenchmark for strings column (#7040) @davidwendt
Refactor io memory fetches to use hostdevice_vector methods (#7035) @ChrisJar
Spark Murmur3 hash functionality (#7024) @rwlee
Fix libcudf strings logic where size_type is used to access INT32 column data (#7020) @davidwendt
Adding decimal writing support to parquet (#7017) @hyperbolic2346
Add compression="infer" as default for daskcudf.readcsv (#7013) @rjzamora
Correct ORC docstring; other minor cuIO improvements (#7012) @vuule
Reduce number of hostdevice_vector allocations in parquet reader (#7005) @devavret
Check output size overflow on strings gather (#6997) @davidwendt
Improve representation of MultiIndex (#6992) @galipremsagar
Disable some pragma unroll statements in thrust sort.h (#6982) @davidwendt
Minor cudf::round internal refactoring (#6976) @codereport
Add Java bindings for URL conversion (#6972) @jlowe
Enable strictdecimaltypes in parquet reading (#6969) @sperlingxx
Add in basic support to JNI for logical_cast (#6954) @revans2
Remove duplicate file array_tests.cpp (#6953) @karthikeyann
Add null mask fixed_point_column_wrapper constructors (#6951) @codereport
Update Java bindings version to 0.18-SNAPSHOT (#6949) @jlowe
Use simplified rmm::exec_policy (#6939) @harrism
Add null count test for applybooleanmask (#6903) @harrism
Implement DataFrame.quantile for datetime and timedelta data types (#6902) @ChrisJar
Remove **kwargs from string/categorical methods (#6750) @shwina
Refactor rolling.cu to reduce compile time (#6512) @mythrocks
Add static type checking via Mypy (#6381) @shwina
Update to official libcu++ on Github (#6275) @trxcllnt

- C++
Published by GPUtester over 5 years ago

https://github.com/rapidsai/cudf - v0.17.0

v0.17.0 Release

- C++
Published by GPUtester over 5 years ago

https://github.com/rapidsai/cudf - v0.16.0

v0.16.0 Release

- C++
Published by GPUtester over 5 years ago

https://github.com/rapidsai/cudf - v0.15.0

v0.15.0 Release

- C++
Published by raydouglass over 5 years ago