close
Skip to content

Replace custom size kernel with cub::DeviceReduce::TransformReduce#811

Open
PointKernel wants to merge 1 commit intoNVIDIA:devfrom
PointKernel:replace-size-kernel-with-cub-sum
Open

Replace custom size kernel with cub::DeviceReduce::TransformReduce#811
PointKernel wants to merge 1 commit intoNVIDIA:devfrom
PointKernel:replace-size-kernel-with-cub-sum

Conversation

@PointKernel
Copy link
Copy Markdown
Member

This PR replaces the custom size kernel with the corresponding cub util and the performance has been improved significantly with large capacities.

Microbenchmark timing of static_map::size() only. NVIDIA GH200 480GB, CUDA 13.1 / GCC 14.3. Map populated with NumInputs unique I64 keys at 0.5 occupancy (so capacity = 2 × NumInputs).

Capacity (slots) Before (custom kernel) After (cub::DeviceReduce::TransformReduce) Speedup
2,000,000 47.03 us 43.74 us 1.08x
20,000,000 232.11 us 122.31 us 1.90x
200,000,000 2,360 us 869.70 us 2.71x
1,000,000,000 11,927 us 4,212 us 2.83x

@PointKernel PointKernel requested a review from sleeepyjack as a code owner May 1, 2026 19:10
@PointKernel PointKernel added type: improvement Improvement / enhancement to an existing function topic: performance Performance related issue labels May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic: performance Performance related issue type: improvement Improvement / enhancement to an existing function

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant