close
Skip to content

[core] Job stuck in PENDING forever when HTTP request is cancelled during submission #62766

@Myasuka

Description

@Myasuka

What happened + What you expected to happen

When a job submission HTTP request is cancelled (e.g., client timeout, connection reset) after the job has been written to GCS as PENDING but before the supervisor actor is created, the job remains in PENDING status forever. No supervisor actor is spawned, and the job is never cleaned up — it becomes an orphan.

Expected: The job should transition to FAILED with a clear error message.

How to reproduce
This bug requires a specific timing sequence:

  1. Client submits a job via POST /api/jobs/
  2. The dashboard agent handler writes the job to GCS as PENDING (via put_info(overwrite=False))
  3. Before the supervisor actor is created, asyncio.CancelledError is raised — this happens when aiohttp cancels the handler coroutine because the HTTP client disconnected
  4. The CancelledError propagates through submit_job uncaught, leaving the job permanently PENDING

Versions / Dependencies

ray-2.47
python-3.9

Reproduction script

In job_manager.py submit_job(), after put_info(submission_id, job_info, overwrite=False) writes PENDING to GCS, there are multiple await points before the supervisor actor is created:

put_info(PENDING)          ← job exists in GCS
    ↓
await _get_scheduling_strategy(...)    ← CancelledError can strike here
    ↓
supervisor_actor.options(...).remote(...)
    ↓
supervisor.run.remote(...)
    ↓
run_background_task(_monitor_job(...))  ← only now is the job "alive"
The existing error handler is:
except Exception as e:
    await self._job_info_client.put_status(submission_id, JobStatus.FAILED, ...)

asyncio.CancelledError inherits from BaseException, not Exception. So this handler does not catch it. The CancelledError propagates out, leaving an orphan PENDING entry in GCS.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CorestabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions