What happened + What you expected to happen
When a job submission HTTP request is cancelled (e.g., client timeout, connection reset) after the job has been written to GCS as PENDING but before the supervisor actor is created, the job remains in PENDING status forever. No supervisor actor is spawned, and the job is never cleaned up — it becomes an orphan.
Expected: The job should transition to FAILED with a clear error message.
How to reproduce
This bug requires a specific timing sequence:
- Client submits a job via POST /api/jobs/
- The dashboard agent handler writes the job to GCS as PENDING (via put_info(overwrite=False))
- Before the supervisor actor is created, asyncio.CancelledError is raised — this happens when aiohttp cancels the handler coroutine because the HTTP client disconnected
- The CancelledError propagates through submit_job uncaught, leaving the job permanently PENDING
Versions / Dependencies
ray-2.47
python-3.9
Reproduction script
In job_manager.py submit_job(), after put_info(submission_id, job_info, overwrite=False) writes PENDING to GCS, there are multiple await points before the supervisor actor is created:
put_info(PENDING) ← job exists in GCS
↓
await _get_scheduling_strategy(...) ← CancelledError can strike here
↓
supervisor_actor.options(...).remote(...)
↓
supervisor.run.remote(...)
↓
run_background_task(_monitor_job(...)) ← only now is the job "alive"
The existing error handler is:
except Exception as e:
await self._job_info_client.put_status(submission_id, JobStatus.FAILED, ...)
asyncio.CancelledError inherits from BaseException, not Exception. So this handler does not catch it. The CancelledError propagates out, leaving an orphan PENDING entry in GCS.
Issue Severity
Medium: It is a significant difficulty but I can work around it.
What happened + What you expected to happen
When a job submission HTTP request is cancelled (e.g., client timeout, connection reset) after the job has been written to GCS as PENDING but before the supervisor actor is created, the job remains in PENDING status forever. No supervisor actor is spawned, and the job is never cleaned up — it becomes an orphan.
Expected: The job should transition to FAILED with a clear error message.
How to reproduce
This bug requires a specific timing sequence:
Versions / Dependencies
ray-2.47
python-3.9
Reproduction script
In job_manager.py submit_job(), after put_info(submission_id, job_info, overwrite=False) writes PENDING to GCS, there are multiple await points before the supervisor actor is created:
asyncio.CancelledErrorinherits from BaseException, not Exception. So this handler does not catch it. The CancelledError propagates out, leaving an orphan PENDING entry in GCS.Issue Severity
Medium: It is a significant difficulty but I can work around it.