ggml-cuda : add defensive VRAM margin to prevent swapping stalls by demiquartz · Pull Request #22580 · ggml-org/llama.cpp

demiquartz · 2026-05-01T10:12:59Z

Overview

This PR adds a defensive VRAM margin in ggml_cuda_device_malloc to prevent system instability and computation stalls caused by VRAM overcommit, specifically in WSL2 with ROCm environments.

256MB Margin: Subtracted from reported free VRAM to prevent the OS from initiating aggressive RAM swapping.
1GB Grace Interval: Implemented for free_caches to minimize the performance overhead of calling cudaMemGetInfo.

Additional information

In environments like WSL2/ROCm (tested with 7900 XTX), exceeding physical VRAM triggers a "swapping loop" where active inference memory is moved between VRAM and system RAM. This saturates the PCIe bus, causing the GPU computation to stall indefinitely and leading to rapid system RAM exhaustion, even if the process remains active. This PR ensures llama.cpp reports a memory allocation error before this stall occurs.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES. AI (Gemini) was used in an assistive capacity to verify C++ implementation details, including operator precedence, type safety for size_t calculations, and cross-compiler compatibility. The core logic (Margin and Grace thresholds) and the diagnosis of the swapping-related stall were conceptualized and validated by the human contributor through practical troubleshooting.

Closes #22583

* Add a 256MB defensive margin to ggml_cuda_device_malloc to prevent VRAM overcommit * Implement a 1GB grace interval for free_caches to reduce cudaMemGetInfo overhead * Prevents "swapping loops" that lead to system RAM exhaustion and computation stalls, especially on WSL2/ROCm Assisted-by: AI (Gemini)

am17an · 2026-05-01T10:58:25Z

Create an issue with a reproduction first

demiquartz · 2026-05-01T12:38:26Z

I have created the issue with detailed reproduction steps and environmental information as requested: #22583
This issue includes a screenshot of the VRAM/System RAM behavior on WSL2 and the specific setup script used to reproduce the hang. Since the root cause and reproduction are now clarified, I would like to request a re-review of this PR.

demiquartz requested a review from a team as a code owner May 1, 2026 10:13

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 1, 2026

demiquartz mentioned this pull request May 1, 2026

Eval bug: Checkpoints and MMProj on Gemma 4 consume abnormal amounts of RAM, leading to llama-server going OOM #21690

Open

am17an closed this May 1, 2026

demiquartz mentioned this pull request May 1, 2026

Eval bug: System RAM exhaustion and crash due to VRAM overcommit on WSL2 (Ubuntu 24.04) #22583

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cuda : add defensive VRAM margin to prevent swapping stalls#22580

ggml-cuda : add defensive VRAM margin to prevent swapping stalls#22580
demiquartz wants to merge 1 commit intoggml-org:masterfrom
demiquartz:ggml-cuda-vram-margin-wsl2

demiquartz commented May 1, 2026 •

edited

Loading

Uh oh!

am17an commented May 1, 2026

Uh oh!

demiquartz commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

demiquartz commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

am17an commented May 1, 2026

Uh oh!

demiquartz commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

demiquartz commented May 1, 2026 •

edited

Loading