Feature #22011
openHash tables with swiss table
Description
This change adds a Swiss-table-inspired probing layer to Ruby's core st_table, and shrinks st_table_entry from 24 B to 16 B by moving the stored hash into a parallel array. It is built and enabled by default; --disable-swiss-st reverts to the original st.c. The public ABI of struct st_table and the iteration-order guarantee are preserved.
Motivation¶
Hashes are everywhere in Ruby — instance-variable tables, ivar shapes, constant tables, JSON/HTTP/AR rows, every params, every to_h. Profiles of Rails-shaped workloads spend a meaningful fraction of CPU inside st_lookup and st_insert. Two pain points stood out in the upstream implementation:
-
Probe loops are branch-heavy. Every step of the perturb chain loads a bin, fetches the entry it points to, compares the full
st_hash_t(8 B) and only then callseql?. On a miss that is several dependent loads per probe with no way to fast-reject groups of slots in parallel. -
st_table_entryis 24 B. The(hash, key, record)triple gets one cache line per ~2.5 entries. Iteration and equality scans burn through L1 quickly, and Ruby programs typically hold a lot of small-to-mid-sized hashes (so per-table overhead matters).
The Swiss-table family of designs (Abseil flat_hash_map, Rust hashbrown) addresses (1) with a 1-byte-per-slot control array that lets a single SIMD/SWAR comparison reject or short-list 8 slots at once. We borrow that idea but keep Ruby's two-array layout (so we don't break ABI or insertion order) and add a third, parallel ctrl[] byte array. We then attack (2) by also extracting the hash field out of st_table_entry into its own parallel uint32_t hashes[] array.
Changes¶
- Adds
ST_USE_SWISS_BINS, enabled underRUBY_EXPORT, disabled otherwise soparser_st.ckeeps the old-compatible layout. - Removes the stored hash field from st_table_entry in the Swiss path and stores hashes in a side array after entries, using 32-bit stored hashes.
- Introduces packed Swiss bin groups: 8 control bytes plus 8 bin indexes per group.
- Adds Swiss probing via control-byte matching, using 7-bit h2 fingerprints for candidate filtering.
- Packs bins after entries when possible, reducing separate allocation pressure.
- Updates allocation, copy, free, memsize, rebuild, rehash, insert, delete, shift, lookup, foreach, keys, and values paths to go through hash/bin abstraction macros.
- Adjusts rebuild thresholds for Swiss load factor, using roughly 7/8 bin occupancy.
- Adds tombstone-triggered rebuild behavior when many deleted slots accumulate.
Analysis¶
Earlier preliminarily analysis shows HUGE improvements on microbenchmarks after adopting ideas from swiss tables, but it is not quite true in the real-world benches.
Swiss-table-style probing can look huge in isolated lookup/insert microbenchmarks because the hot loop is tight, the table stays cache-hot, and control-byte filtering avoids many full entry/key comparisons.
In real Ruby workloads, the hash table operations are mixed with object allocation, method dispatch, GC barriers, string/hash callbacks, comparisons, branchy VM work, etc. That weakens the “everything is in L1” assumption.
If the old table is kept at < 0.5 load factor, collision chains/probe lengths are already short, so a simpler open-addressing scheme can be very competitive because it has smaller code and fewer moving parts.
The larger practical win of the Swiss-ish design is probably memory density: sustaining a much higher load factor, around 7/8, without making probe behavior collapse. With this new method, we could increase the load factor to about 7/8 with competitive speed. This saves us more memory then the introduced ctrl[].
Files
Updated by dsh0416 (Delton Ding) 11 days ago
- Description updated (diff)
Updated by dsh0416 (Delton Ding) 11 days ago
- Description updated (diff)
Updated by dsh0416 (Delton Ding) 11 days ago
- Description updated (diff)
Updated by dsh0416 (Delton Ding) 11 days ago
- File swiss-hash.patch added
- File deleted (
patch.diff)
fix some regression bugs
Updated by dsh0416 (Delton Ding) 11 days ago
- File output_001.txt output_001.txt added
- File output_002.txt output_002.txt added
add ruby bench results
Updated by dsh0416 (Delton Ding) 11 days ago
- File swiss-hash.patch added
- File deleted (
swiss-hash.patch)
Updated by dsh0416 (Delton Ding) 10 days ago
I found some crash cases, that I need to fix it first. Also, I am running the full ruby-bench against the master branch on linux and windows to see if I've missed anything.
Updated by byroot (Jean Boussier) 10 days ago
Just a side note, enlarging struct st_table by 16B as some consequence for various objects sizes (e.g. Hash goes from pool 80 to pool 96, but a bunch of other structs embed struct st_table).
I had a vague plan to collocate the bins and entries buffers like I did with set_table (https://github.com/ruby/ruby/commit/85c52079aa35a1d2e063a5b40eebe91701c8cb9e). If we end up with two more buffers, it becomes more important.
Updated by dsh0416 (Delton Ding) 10 days ago
- File deleted (
swiss-hash.patch)
Updated by dsh0416 (Delton Ding) 10 days ago
- File changes.patch added
Updated by dsh0416 (Delton Ding) 10 days ago
· Edited
I see that the benchmarks might be affected, since st_numhash/symbol hash path looks like returning constant 128 on bits 25..31, which increases the hash collision, under investigating.
Updated by dsh0416 (Delton Ding) 9 days ago
· Edited
- File memory_top_rss_changes.png memory_top_rss_changes.png added
- File time_by_test_violin_interp.png time_by_test_violin_interp.png added
- File time_by_test_violin_yjit.png time_by_test_violin_yjit.png added
- Description updated (diff)
Here are the ruby benches results.


| Area | interp | yjit |
|---|---|---|
| Time geomean |
+1.77% slower |
-0.17% faster |
| Time median benchmark delta |
+1.55% slower |
+0.02% flat |
| Benchmarks >1% slower | 42 | 15 |
| Benchmarks >1% faster | 9 | 15 |
| RSS geomean |
-3.66% lower |
-3.01% lower |
| RSS median benchmark delta |
+0.09% flat |
+0.09% flat |
Overall: swss is a small interpreter-mode slowdown, essentially neutral or faster under YJIT, and generally lower RSS. The RSS median is basically unchanged, so the memory improvement comes from a subset of large reductions rather than broad small wins.
Largest time slowdowns:
-
yjit/loops-times:+12.52% -
interp/structaset:+9.38% -
interp/fib:+8.04% -
interp/keyword_args:+6.76% -
interp/send_rubyfunc_block:+6.15%
Largest time speedups:
-
yjit/activerecord:-13.38% -
yjit/fluentd:-11.73% -
yjit/send_cfunc_block:-8.96% -
yjit/nqueens:-6.82% -
yjit/splay:-4.47%

Memory is mixed. Very few benches show RSS increases, including interp/graphql-native +20.15%, yjit/fluentd +19.32%, and yjit/graphql-native +17.42%. Biggest RSS reductions are yjit/str_concat -42.45%, interp/psych-load -39.31%, and yjit/psych-load -38.07%, and RSS reductions are more often.
Updated by dsh0416 (Delton Ding) 9 days ago
- Description updated (diff)
Updated by dsh0416 (Delton Ding) 9 days ago
- File deleted (
changes.patch)
Updated by dsh0416 (Delton Ding) 9 days ago
- File changes.patch changes.patch added
Updated by dsh0416 (Delton Ding) 9 days ago
From the bench results, we could see some very positive signals in some benches, but it also introduces some regressions, I would try if I could further narrowing down the regressions.
Updated by dsh0416 (Delton Ding) 9 days ago
· Edited
Updated by dsh0416 (Delton Ding) 9 days ago
· Edited
- File memory_top_rss_changes.png memory_top_rss_changes.png added
- File time_by_test_violin_interp.png time_by_test_violin_interp.png added
- File time_by_test_violin_yjit.png time_by_test_violin_yjit.png added
- File changes.patch changes.patch added
The regression got better controlled with the new patch.



Updated by dsh0416 (Delton Ding) 9 days ago

The Student T test shows that the performance is almost identical, but the memory consumption is confidently lowered by about 3%.