close
Skip to content

Fix pathological performance in trait solver cycles with errors#155355

Open
erickt wants to merge 1 commit intorust-lang:mainfrom
erickt:trait-solver-hang
Open

Fix pathological performance in trait solver cycles with errors#155355
erickt wants to merge 1 commit intorust-lang:mainfrom
erickt:trait-solver-hang

Conversation

@erickt
Copy link
Copy Markdown
Contributor

@erickt erickt commented Apr 15, 2026

Fuchsia's Starnix system has had a multi-year long bug where occasionally a typo could cause the rust compiler to take 10+ hours to report an error (see #136516 and #150907). This was particularly hard to trace down since Starnix's codebase is massive, over 384 thousand lines as of writing.

With the help of treereduce, cargo-minimize, and rustmerge, after about a month of running we reduced it down to a couple lines of code, which only takes about 35 seconds to report an error on my machine. The bug also appears to happen with -Z next-solver=no and -Z next-solver=coherence, but does not occur with -Z next-solver or -Z next-solver=globally.

I used Gemini to help diagnose the problem and proposed solution (which is the one proposed in this patch):

  1. The trait solver gets stuck in an exponential loop evaluating auto-trait bounds (like Send and Sync) on cyclic types that contain compilation errors (TyKind::Error).

  2. Normally, if the solver detects a cycle, it prevents the result from being stored in the Global Cache because the result depends on the current evaluation stack. However, when an error is involved, the depth tracking gets pinned to a low value, forcing the solver to rely on the short-lived Provisional Cache. Since the provisional cache is cleared between high-level iterations of the fulfillment loop, the solver ends up re-discovering and re-evaluating the same large cycle thousands of times.

  3. Allow global caching of results even if they appear stack-dependent, provided that the inference context is already "tainted by errors" (self.infcx.tainted_by_errors().is_some()). This violates the strict invariant that global cache entries shouldn't depend on the stack, but it is safe because the compilation is already guaranteed to fail due to the presence of errors. Prioritizing compiler responsiveness and termination over perfect correctness in error states is the correct trade-off here.

I added the reduction as the test case for this. However, I don't see an easy way to catch if this bug comes back. Should we add some way to timeout the test if it takes longer than 10 seconds to compile? That could be a source of flakes though.

I don't have any experience with the trait solver code, but I did try to review the code to the best of my ability. This approach seems a bit of a bandaid to the solution, but I don't see a better solution. We could try to teach the solver to not clear the provisional cache in this circumstance, but I suspect that'd be a pretty invasive change.

I'm guessing if this does cause problems, it might report an incorrect error, but I (and Gemini) were unable to come up with an example that reported a different error with and without this fix.

Resolves #150907

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Apr 15, 2026
@rustbot
Copy link
Copy Markdown
Collaborator

rustbot commented Apr 15, 2026

r? @dingxiangfei2009

rustbot has assigned @dingxiangfei2009.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Why was this reviewer chosen?

The reviewer was selected based on:

  • Owners of files modified in this PR: compiler
  • compiler expanded to 72 candidates
  • Random selection from 17 candidates

@rustbot

This comment has been minimized.

@erickt erickt force-pushed the trait-solver-hang branch from de41321 to cca98a2 Compare April 15, 2026 19:24
@erickt
Copy link
Copy Markdown
Contributor Author

erickt commented Apr 15, 2026

Thanks rustbot, I removed the issue links from the commit message.

@BoxyUwU
Copy link
Copy Markdown
Member

BoxyUwU commented Apr 15, 2026

r? lcnr

@rustbot rustbot assigned lcnr and unassigned dingxiangfei2009 Apr 15, 2026

let reached_depth = stack.reached_depth.get();
if reached_depth >= stack.depth {
if reached_depth >= stack.depth || self.infcx.tainted_by_errors().is_some() {
Copy link
Copy Markdown
Contributor

@lcnr lcnr Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a very general fix, does limiting this to goals which contain an error type themselves also fix the hang?

I agree that this change won't be unsound and at worst causes confusing errors. I am slightly worried as it does break - and already broken - invariant which can be a pain when trying to add assertion later on or if there's weird behavior we don't quite get. Given that it's limited to the old solver which will be removed this year, that is fine.

Please add a comment explaining why we're doing is/maybe even make let Some(_guar) = self.infcx.tainted_by_errors() into a separate branch

cc @rust-lang/types I don't think this needs an FCP as it reverting it won't be a breaking change and it seems trivial enough, still want to make sure nobody is strongly opposed here :3

View changes since the review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, I pulled it out into a separate branch with a comment in the latest revision.

this seems like a very general fix, does limiting this to goals which contain an error type themselves also fix the hang?

Would that be by switching from self.infcx.tainted_by_errors() to fresh_trait_pred.references_error()? If so, Gemini and I did try that out, and it didn't seem to fix it. The test I added regressed from ~0.06s to 35s. Here's the argument the AI made (it makes sense to me, but I'm not an expert here so take it with a grain of salt):


This approach was ineffective and the compiler still hung (taking over 70 seconds in the minimal reproduction case, and likely hours in the full codebase).

The reason is that in complex cyclic types (like those in Starnix):

  1. Indirect Errors: The error might not be directly in the trait predicate we are currently evaluating. For example, we might be evaluating Node: Send, and Node itself looks perfectly valid. However, Node might contain a field that eventually resolves to an error, or it might be connected through a long chain of other types to an error.

  2. Inference Context: The error might be recorded in the InferenceContext (e.g., an inference variable failed to unify) but hasn't propagated to the specific fresh_trait_pred we are looking at yet.

Because references_error() only looks at the immediate structure of the predicate, it missed these indirect cases. The solver would still see the result as "stack-dependent" (because of the cycle) and refuse to cache it globally, leading to the same exponential re-evaluation loop.

Why tainted_by_errors() is necessary

Switching to self.infcx.tainted_by_errors().is_some() solved the problem because it acts as a global flag for the current inference session.

If any error has occurred anywhere in this inference context, we know the compilation is doomed. By allowing global caching for all completed evaluations in this state, we successfully break the exponential cycles, regardless of whether the error is direct or indirect.

This is a pragmatic trade-off: we accept that we might cache slightly incorrect results or suppress secondary errors in order to guarantee that the compiler actually terminates and reports the first error.

Fuchsia's Starnix system has had a multi-year long bug where
occasionally a typo could cause the rust compiler to take 10+ hours to
report an error. This was particularly hard to trace down since
Starnix's codebase is massive, over 384 thousand lines as of writing.

With the help of treereduce, cargo-minimize, and rustmerge, after about
a month of running we reduced it down to a couple [lines of code], which
only takes about 35 seconds to report an error on my machine. The bug
also appears to happen with `-Z next-solver=no` and `-Z
next-solver=coherence`, but does not occur with `-Z next-solver` or `-Z
next-solver=globally`.

I used Gemini to help diagnose the problem and proposed solution (which
is the one proposed in this patch):

1. The trait solver gets stuck in an exponential loop evaluating
   auto-trait bounds (like Send and Sync) on cyclic types that contain
   compilation errors (TyKind::Error).

2. Normally, if the solver detects a cycle, it prevents the result from
   being stored in the Global Cache because the result depends on the
   current evaluation stack. However, when an error is involved, the
   depth tracking gets pinned to a low value, forcing the solver to rely
   on the short-lived Provisional Cache. Since the provisional cache is
   cleared between high-level iterations of the fulfillment loop, the
   solver ends up re-discovering and re-evaluating the same large cycle
   thousands of times.

3. Allow global caching of results even if they appear stack-dependent,
   provided that the inference context is already "tainted by errors"
   (`self.infcx.tainted_by_errors().is_some()`). This violates the
   strict invariant that global cache entries shouldn't depend on the
   stack, but it is safe because the compilation is already guaranteed
   to fail due to the presence of errors. Prioritizing compiler
   responsiveness and termination over perfect correctness in error
   states is the correct trade-off here.

I added the reduction as the test case for this. However, I don't see an
easy way to catch if this bug comes back. Should we add some way to
timeout the test if it takes longer than 10 seconds to compile? That
could be a source of flakes though.

I don't have any experience with the trait solver code, but I did try to
review the code to the best of my ability. This approach seems a bit of
a bandaid to the solution, but I don't see a better solution. We could
try to teach the solver to not clear the provisional cache in this
circumstance, but I suspect that'd be a pretty invasive change.

I'm guessing if this does cause problems, it might report an incorrect
error, but I (and Gemini) were unable to come up with an example that
reported a different error with and without this fix.

[lines of code]: https://gist.github.com/erickt/255bc4006292cac88de906bd6bd9220a
@erickt erickt force-pushed the trait-solver-hang branch from cca98a2 to 0f19b6d Compare April 17, 2026 21:29
@erickt erickt requested a review from lcnr April 17, 2026 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Long delay printing errors with metadata

5 participants