Handle cluster slot migration in ClusterPubSub (shard subscription reconciliation)#4044
Open
petyaslavova wants to merge 3 commits intomasterfrom
Open
Handle cluster slot migration in ClusterPubSub (shard subscription reconciliation)#4044petyaslavova wants to merge 3 commits intomasterfrom
petyaslavova wants to merge 3 commits intomasterfrom
Conversation
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit b18552b. Configure here.
vladvildanov
requested changes
Apr 23, 2026
| if not self.shard_channels: | ||
| return | ||
| try: | ||
| loop = asyncio.get_running_loop() |
Collaborator
There was a problem hiding this comment.
Why is this method is not async? I believe it's possible to convert it to async and do not interact with loop itself
| # Notify observers so shard-pubsub reconciliation can run; skipped on | ||
| # the no-op branch to avoid needless walks under MOVED storms. | ||
| if node_changed: | ||
| self._notify_pubsub_observers() |
Collaborator
There was a problem hiding this comment.
I think we should leverage EventDispatcher for the events handling, for the alignment
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Adds shard-PubSub reconciliation on cluster topology changes. Before this change, when a slot holding an active
SSUBSCRIBEmigrated to a different node (CLUSTER SETSLOT, failover, or rebalance), the client kept reading from the old owner and silently missed messages — the reconciliation hook simply did not exist. This PR wires an observer pattern betweenNodesManagerandClusterPubSubso that every slots-cache update triggers a reconciliation pass: for each tracked shard channel we re-resolve the owning node and, if it has moved, performSUNSUBSCRIBEon the old per-node pubsub andSSUBSCRIBEon the new one, preserving any registered handler. Implemented with strict sync/async parity inredis/cluster.pyandredis/asyncio/cluster.py.Design choices
_shard_channel_to_node) records the node that actually holds each subscription, soSUNSUBSCRIBEroutes to the holding node even after the cluster's slot map has moved on —cluster.get_node_from_keyby then points to the new owner and would miss the live subscription.SlotNotCoveredErrorper channel, continues with coverable siblings, then raises a single summarySlotNotCoveredErrorat the end so callers know the pass was incomplete. The next slots-cache change triggers a retry.NodesManagerobserver registry is guarded byself._lock; callbacks are invoked outside the lock so observers can safely call back into the NodesManager. Async relies on the event loop's single-threaded guarantee.move_slotare treated as a no-op and skip observer notification, avoiding reconciliation storms.Resource management
Two connection-release points ensure per-node pubsubs do not leak their dedicated connections when dropped from
node_pubsub_mapping(PubSub.__del__does not release connections back to the pool): (1) the migration-drivenSUNSUBSCRIBEconfirmation branch inget_sharded_messagecallsaclose()(async) /reset()(sync) on the per-node pubsub before the mapping drop; (2) the GC block at the end ofreinitialize_shard_subscriptionsdoes the same for any per-node pubsub emptied by the reconciliation pass. On the async side,_on_slots_changedschedules reconciliation as a fire-and-forget task with a done-callback (_log_reconcile_task_exception) that consumes the task's exception and routes it throughlogger.error, preventing "Task exception was never retrieved" warnings and giving the same observability as the sync path's_notify_pubsub_observerstry/except.Tests
New deterministic (mock-based) test classes
TestClusterPubSubSlotMigrationon both sync and async sides cover: channel moves to new owner, no-op when owner unchanged, tolerance to old-node disconnect,SUNSUBSCRIBErouting via the reverse index after migration, handler-override semantics on lazy reroute,acloseteardown (including cancellation of in-flight reconcile tasks, async), per-node pubsub drop + connection release on migration-drivenSUNSUBSCRIBE, partial-progress semantics when one slot is uncovered, and (async) consumption+logging of the reconcile task's exception.NodesManagerobserver register/notify/failover/circular-redirect paths have four new mock-only tests each side.All new tests are marked
@pytest.mark.fixed_clientand run outside a live cluster; class-level@pytest.mark.onlyclusteronTestNodesManagerwas refactored to test-level so the existing live-cluster coverage is unchanged.Note
Medium Risk
Changes cluster PubSub behavior during slot migration/failover by automatically moving shard subscriptions between nodes and cleaning up per-node pubsub connections; mistakes could cause missed messages or connection churn under topology flaps.
Overview
Adds a slots-cache observer mechanism in
NodesManager(sync + asyncio) and wiresClusterPubSubinto it so slot-map refreshes and realMOVED-driven slot changes trigger shard-subscription reconciliation.ClusterPubSubnow tracks a reverse index (_shard_channel_to_node) to routeSUNSUBSCRIBEto the node that actually holds the subscription, migrates shard channels to the current slot owner (preserving handlers, handling transientSlotNotCoveredError), and aggressively closes/drops empty per-node pubsubs to avoid leaked dedicated connections. Async reconciliation runs as a background task with exception consumption/logging, andaclose()now cancels in-flight reconcile tasks and clears the reverse index.Adds deterministic unit tests for observer registration/notification and shard-subscription migration behavior (sync + async), and adjusts some cluster tests’ markers to run under the fixed mocked client.
Reviewed by Cursor Bugbot for commit 160544f. Bugbot is set up for automated code reviews on this repo. Configure here.