close
Skip to content

fix: use leader election in MergeAgain to unblock parallel plans when PR is behind base#6470

Open
josepmedialdea wants to merge 4 commits into
runatlantis:mainfrom
josepmedialdea:fix/mergeagain-parallel-leader-election
Open

fix: use leader election in MergeAgain to unblock parallel plans when PR is behind base#6470
josepmedialdea wants to merge 4 commits into
runatlantis:mainfrom
josepmedialdea:fix/mergeagain-parallel-leader-election

Conversation

@josepmedialdea
Copy link
Copy Markdown

@josepmedialdea josepmedialdea commented May 12, 2026

Summary

Problem

PR #6376 introduced a read-lock fast path in Clone() so parallel goroutines skip the write lock when the repo is already at the correct commit. That fixed the common case.

However, when the PR is behind its base branch, all goroutines pass through MergeAgain() and independently try to acquire the write lock after detecting divergence. Because runSteps() holds a shared read lock for the entire plan duration (potentially minutes), every write-lock attempt serializes behind the preceding goroutine's plan:

G0: recheckDiverged [read lock] → write lock → merge → runSteps [READ, 5 min]
G1: recheckDiverged [read lock] → write lock → BLOCKED by G0's read lock
G2: recheckDiverged [read lock] → write lock → BLOCKED waiting for G1…

Every goroutine waits for the preceding goroutine's entire plan to complete before it can even discover whether a merge is still needed.

Fix

Elect exactly one goroutine as the merge leader using sync.Map.LoadOrStore. The leader acquires the write lock and merges as before. All other goroutines receive the result through a chan struct{} that is completely independent of the RWMutex — no write-lock contention at all:

G0: recheckDiverged [read lock] → leader → write lock → merge → runSteps [read]
G1: recheckDiverged [read lock] → follower → <-done (channel) → runSteps [read]
G2: recheckDiverged [read lock] → follower → <-done (channel) → runSteps [read]

All goroutines proceed to runSteps in parallel as soon as the single merge finishes.

Defer ordering: the channel is closed and the map entry deleted while the write lock is still held (LIFO defer order), so no new goroutine can observe and join a completed-but-not-yet-deleted entry after the lock is released.

Testing

  • New regression test TestMergeAgain_ConcurrentDiverged: creates a merge-checkout workspace, advances the base branch to force divergence, launches 5 concurrent MergeAgain goroutines, and asserts no errors and that the workspace on disk contains the base-branch update
  • Validated in production against a real Atlantis instance with multiple parallel projects where the PR was behind the base branch. Plans now run concurrently instead of sequentially

Related

Notes

  • Developed with assistance from Claude Opus 4.6 via Claude Code.
  • The fix was validated in production before submission.

Copilot AI review requested due to automatic review settings May 12, 2026 15:52
@dosubot dosubot Bot added the go Pull requests that update Go code label May 12, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…hind base

Signed-off-by: Josep Medialdea <josep@medialdea.io>
…ncurrent test

Signed-off-by: Josep Medialdea <josep@medialdea.io>
@josepmedialdea josepmedialdea force-pushed the fix/mergeagain-parallel-leader-election branch from 0cdbbd8 to eef48e1 Compare May 13, 2026 13:19
Signed-off-by: Josep Medialdea <josep@medialdea.io>
@pseudomorph
Copy link
Copy Markdown
Contributor

Is this perhaps an issue with recheckRequiredMap? I think there might be some overlap in intent between the new pendingMerge and recheckRequiredMap. I want to make sure we're not introducing another lock if we can replace or improve an existing one.

@josepmedialdea
Copy link
Copy Markdown
Author

Is this perhaps an issue with recheckRequiredMap? I think there might be some overlap in intent between the new pendingMerge and recheckRequiredMap. I want to make sure we're not introducing another lock if we can replace or improve an existing one.

@pseudomorph recheckRequiredMap was doing the “someone else already handled this” check back when MergeAgain held the exclusive lock before checking divergence. But after #6376, the divergence check happens before the write lock, so that map only helps once a goroutine is already on the write-lock path. That’s too late for this issue.

pendingMerges fixes that by coordinating earlier, but now there are two bits of state trying to do similar things.

I can rework this so there’s just one in-flight MergeAgain per clone dir: one goroutine checks divergence and merges if needed, and the others wait for that result. That would remove recheckRequiredMap. Followers would return merged=true if the leader merged, matching the current behaviour.

Before I do that:

  1. Is that what you had in mind?
  2. Are you OK with only one goroutine running recheckDiverged?
  3. Would you rather have that cleanup here, or in a follow-up?

@pseudomorph
Copy link
Copy Markdown
Contributor

I think we can just drop recheckRequiredMap without any further modification. Your changes should cover it's intent as-is.

@josepmedialdea
Copy link
Copy Markdown
Author

Agree, that's cleaner. Will push an update

Signed-off-by: Josep Medialdea <josep@medialdea.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Pull requests that update Go code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants