close
Skip to content

ggml-cuda : add defensive VRAM margin to prevent swapping stalls#22580

Closed
demiquartz wants to merge 1 commit intoggml-org:masterfrom
demiquartz:ggml-cuda-vram-margin-wsl2
Closed

ggml-cuda : add defensive VRAM margin to prevent swapping stalls#22580
demiquartz wants to merge 1 commit intoggml-org:masterfrom
demiquartz:ggml-cuda-vram-margin-wsl2

Conversation

@demiquartz
Copy link
Copy Markdown

@demiquartz demiquartz commented May 1, 2026

Overview

This PR adds a defensive VRAM margin in ggml_cuda_device_malloc to prevent system instability and computation stalls caused by VRAM overcommit, specifically in WSL2 with ROCm environments.

  • 256MB Margin: Subtracted from reported free VRAM to prevent the OS from initiating aggressive RAM swapping.
  • 1GB Grace Interval: Implemented for free_caches to minimize the performance overhead of calling cudaMemGetInfo.

Additional information

In environments like WSL2/ROCm (tested with 7900 XTX), exceeding physical VRAM triggers a "swapping loop" where active inference memory is moved between VRAM and system RAM. This saturates the PCIe bus, causing the GPU computation to stall indefinitely and leading to rapid system RAM exhaustion, even if the process remains active. This PR ensures llama.cpp reports a memory allocation error before this stall occurs.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES. AI (Gemini) was used in an assistive capacity to verify C++ implementation details, including operator precedence, type safety for size_t calculations, and cross-compiler compatibility. The core logic (Margin and Grace thresholds) and the diagnosis of the swapping-related stall were conceptualized and validated by the human contributor through practical troubleshooting.

Closes #22583

* Add a 256MB defensive margin to ggml_cuda_device_malloc to prevent VRAM overcommit
* Implement a 1GB grace interval for free_caches to reduce cudaMemGetInfo overhead
* Prevents "swapping loops" that lead to system RAM exhaustion and computation stalls, especially on WSL2/ROCm

Assisted-by: AI (Gemini)
@demiquartz demiquartz requested a review from a team as a code owner May 1, 2026 10:13
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 1, 2026
@am17an am17an closed this May 1, 2026
@am17an
Copy link
Copy Markdown
Contributor

am17an commented May 1, 2026

Create an issue with a reproduction first

@demiquartz
Copy link
Copy Markdown
Author

I have created the issue with detailed reproduction steps and environmental information as requested: #22583
This issue includes a screenshot of the VRAM/System RAM behavior on WSL2 and the specific setup script used to reproduce the hang. Since the root cause and reproduction are now clarified, I would like to request a re-review of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: System RAM exhaustion and crash due to VRAM overcommit on WSL2 (Ubuntu 24.04)

2 participants