close

Project

General

Profile

Actions

Feature #22011

open
Image

Hash tables with swiss table

Feature #22011: Hash tables with swiss table
1

Added by dsh0416 (Delton Ding) 11 days ago. Updated 9 days ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:125336]

Description

This change adds a Swiss-table-inspired probing layer to Ruby's core st_table, and shrinks st_table_entry from 24 B to 16 B by moving the stored hash into a parallel array. It is built and enabled by default; --disable-swiss-st reverts to the original st.c. The public ABI of struct st_table and the iteration-order guarantee are preserved.

Motivation

Hashes are everywhere in Ruby — instance-variable tables, ivar shapes, constant tables, JSON/HTTP/AR rows, every params, every to_h. Profiles of Rails-shaped workloads spend a meaningful fraction of CPU inside st_lookup and st_insert. Two pain points stood out in the upstream implementation:

  1. Probe loops are branch-heavy. Every step of the perturb chain loads a bin, fetches the entry it points to, compares the full st_hash_t (8 B) and only then calls eql?. On a miss that is several dependent loads per probe with no way to fast-reject groups of slots in parallel.
  2. st_table_entry is 24 B. The (hash, key, record) triple gets one cache line per ~2.5 entries. Iteration and equality scans burn through L1 quickly, and Ruby programs typically hold a lot of small-to-mid-sized hashes (so per-table overhead matters).

The Swiss-table family of designs (Abseil flat_hash_map, Rust hashbrown) addresses (1) with a 1-byte-per-slot control array that lets a single SIMD/SWAR comparison reject or short-list 8 slots at once. We borrow that idea but keep Ruby's two-array layout (so we don't break ABI or insertion order) and add a third, parallel ctrl[] byte array. We then attack (2) by also extracting the hash field out of st_table_entry into its own parallel uint32_t hashes[] array.

Changes

  1. Adds ST_USE_SWISS_BINS, enabled under RUBY_EXPORT, disabled otherwise so parser_st.c keeps the old-compatible layout.
  2. Removes the stored hash field from st_table_entry in the Swiss path and stores hashes in a side array after entries, using 32-bit stored hashes.
  3. Introduces packed Swiss bin groups: 8 control bytes plus 8 bin indexes per group.
  4. Adds Swiss probing via control-byte matching, using 7-bit h2 fingerprints for candidate filtering.
  5. Packs bins after entries when possible, reducing separate allocation pressure.
  6. Updates allocation, copy, free, memsize, rebuild, rehash, insert, delete, shift, lookup, foreach, keys, and values paths to go through hash/bin abstraction macros.
  7. Adjusts rebuild thresholds for Swiss load factor, using roughly 7/8 bin occupancy.
  8. Adds tombstone-triggered rebuild behavior when many deleted slots accumulate.

Analysis

Earlier preliminarily analysis shows HUGE improvements on microbenchmarks after adopting ideas from swiss tables, but it is not quite true in the real-world benches.

Swiss-table-style probing can look huge in isolated lookup/insert microbenchmarks because the hot loop is tight, the table stays cache-hot, and control-byte filtering avoids many full entry/key comparisons.

In real Ruby workloads, the hash table operations are mixed with object allocation, method dispatch, GC barriers, string/hash callbacks, comparisons, branchy VM work, etc. That weakens the “everything is in L1” assumption.

If the old table is kept at < 0.5 load factor, collision chains/probe lengths are already short, so a simpler open-addressing scheme can be very competitive because it has smaller code and fewer moving parts.

The larger practical win of the Swiss-ish design is probably memory density: sustaining a much higher load factor, around 7/8, without making probe behavior collapse. With this new method, we could increase the load factor to about 7/8 with competitive speed. This saves us more memory then the introduced ctrl[].


Files

output_001.txt (5.27 KB) output_001.txt dsh0416 (Delton Ding), 04/23/2026 10:39 AM
output_002.txt (5.29 KB) output_002.txt dsh0416 (Delton Ding), 04/23/2026 10:39 AM
memory_top_rss_changes.png (111 KB) memory_top_rss_changes.png dsh0416 (Delton Ding), 04/24/2026 03:43 PM
time_by_test_violin_interp.png (613 KB) time_by_test_violin_interp.png dsh0416 (Delton Ding), 04/24/2026 03:43 PM
time_by_test_violin_yjit.png (662 KB) time_by_test_violin_yjit.png dsh0416 (Delton Ding), 04/24/2026 03:43 PM
changes.patch (43.2 KB) changes.patch dsh0416 (Delton Ding), 04/24/2026 04:02 PM
memory_top_rss_changes.png (112 KB) memory_top_rss_changes.png dsh0416 (Delton Ding), 04/25/2026 12:59 AM
time_by_test_violin_interp.png (604 KB) time_by_test_violin_interp.png dsh0416 (Delton Ding), 04/25/2026 12:59 AM
time_by_test_violin_yjit.png (652 KB) time_by_test_violin_yjit.png dsh0416 (Delton Ding), 04/25/2026 12:59 AM
memory_top_rss_changes.png (112 KB) memory_top_rss_changes.png dsh0416 (Delton Ding), 04/25/2026 01:00 AM
time_by_test_violin_interp.png (604 KB) time_by_test_violin_interp.png dsh0416 (Delton Ding), 04/25/2026 01:00 AM
time_by_test_violin_yjit.png (652 KB) time_by_test_violin_yjit.png dsh0416 (Delton Ding), 04/25/2026 01:00 AM
changes.patch (42.3 KB) changes.patch dsh0416 (Delton Ding), 04/25/2026 01:00 AM
clipboard-202604251344-sv2ro.png (72.1 KB) clipboard-202604251344-sv2ro.png dsh0416 (Delton Ding), 04/25/2026 04:44 AM

Image Updated by dsh0416 (Delton Ding) 11 days ago Actions #1

  • Description updated (diff)

Image Updated by dsh0416 (Delton Ding) 11 days ago Actions #2

  • Description updated (diff)

Image Updated by dsh0416 (Delton Ding) 11 days ago Actions #3

  • Description updated (diff)

Image Updated by dsh0416 (Delton Ding) 11 days ago Actions #4 [ruby-core:125337]

  • File swiss-hash.patch added
  • File deleted (patch.diff)

fix some regression bugs

Image Updated by dsh0416 (Delton Ding) 11 days ago Actions #6

  • File swiss-hash.patch added
  • File deleted (swiss-hash.patch)

Image Updated by dsh0416 (Delton Ding) 10 days ago Actions #7 [ruby-core:125340]

I found some crash cases, that I need to fix it first. Also, I am running the full ruby-bench against the master branch on linux and windows to see if I've missed anything.

Image Updated by byroot (Jean Boussier) 10 days ago 1Actions #8 [ruby-core:125345]

Just a side note, enlarging struct st_table by 16B as some consequence for various objects sizes (e.g. Hash goes from pool 80 to pool 96, but a bunch of other structs embed struct st_table).

I had a vague plan to collocate the bins and entries buffers like I did with set_table (https://github.com/ruby/ruby/commit/85c52079aa35a1d2e063a5b40eebe91701c8cb9e). If we end up with two more buffers, it becomes more important.

Image Updated by dsh0416 (Delton Ding) 10 days ago Actions #9

  • File deleted (swiss-hash.patch)

Image Updated by dsh0416 (Delton Ding) 10 days ago Actions #10

  • File changes.patch added

Image Updated by dsh0416 (Delton Ding) 10 days ago · Edited Actions #11 [ruby-core:125348]

I see that the benchmarks might be affected, since st_numhash/symbol hash path looks like returning constant 128 on bits 25..31, which increases the hash collision, under investigating.

Image Updated by dsh0416 (Delton Ding) 9 days ago · Edited Actions #12 [ruby-core:125350]

Here are the ruby benches results.

Image
Image

Area interp yjit
Time geomean +1.77% slower -0.17% faster
Time median benchmark delta +1.55% slower +0.02% flat
Benchmarks >1% slower 42 15
Benchmarks >1% faster 9 15
RSS geomean -3.66% lower -3.01% lower
RSS median benchmark delta +0.09% flat +0.09% flat

Overall: swss is a small interpreter-mode slowdown, essentially neutral or faster under YJIT, and generally lower RSS. The RSS median is basically unchanged, so the memory improvement comes from a subset of large reductions rather than broad small wins.

Largest time slowdowns:

  • yjit/loops-times: +12.52%
  • interp/structaset: +9.38%
  • interp/fib: +8.04%
  • interp/keyword_args: +6.76%
  • interp/send_rubyfunc_block: +6.15%

Largest time speedups:

  • yjit/activerecord: -13.38%
  • yjit/fluentd: -11.73%
  • yjit/send_cfunc_block: -8.96%
  • yjit/nqueens: -6.82%
  • yjit/splay: -4.47%

Image

Memory is mixed. Very few benches show RSS increases, including interp/graphql-native +20.15%, yjit/fluentd +19.32%, and yjit/graphql-native +17.42%. Biggest RSS reductions are yjit/str_concat -42.45%, interp/psych-load -39.31%, and yjit/psych-load -38.07%, and RSS reductions are more often.

Image Updated by dsh0416 (Delton Ding) 9 days ago Actions #13

  • Description updated (diff)

Image Updated by dsh0416 (Delton Ding) 9 days ago Actions #14

  • File deleted (changes.patch)

Image Updated by dsh0416 (Delton Ding) 9 days ago Actions #16 [ruby-core:125351]

From the bench results, we could see some very positive signals in some benches, but it also introduces some regressions, I would try if I could further narrowing down the regressions.

Image Updated by dsh0416 (Delton Ding) 9 days ago Actions #19 [ruby-core:125357]

Image

The Student T test shows that the performance is almost identical, but the memory consumption is confidently lowered by about 3%.

Actions

Also available in: PDF Atom