fix(onnx): use HEURISTIC cudnn_conv_algo_search for ORT GPU session by scyyh11 · Pull Request #17970 · PaddlePaddle/PaddleOCR

scyyh11 · 2026-04-24T21:43:57Z

Summary

tools/infer/utility.py builds an ONNX Runtime CUDA EP session with cudnn_conv_algo_search="DEFAULT". DEFAULT pins cuDNN to CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM. cuDNN rejects this algo for many of PP-OCRv5 server rec's conv configs, so ORT falls back to a non-cuDNN slow path (visible at runtime as OP Conv(...) running in Fallback mode. May be extremely slow. warnings). Steady-state per-shape latency ends up ~50× higher than it should be.

This PR switches the option to HEURISTIC, which asks cuDNN to pick a supported algo per node via cudnnGetConvolutionForwardAlgorithm_v7. Same fix pattern as PaddleX#5057, but for the Python --use_onnx --use_gpu path here (PaddleX#5057 only patches ultra-infer's C++ ORT backend).

Closes/refs #17959.

Benchmark

PP-OCRv5 server rec ONNX, RTX 2080 Ti, onnxruntime-gpu==1.20.0, CUDA 12.6 / cuDNN 9. Cold = 16 unique shapes [B,3,48,W] with B∈{1,2,4,8}, W∈{90,160,320,480} (each first-time). Steady = 5 warmup + 100 timed at a fixed shape.

mode	cold total	cold avg	steady (1,3,48,160)	(4,3,48,320)	(8,3,48,480)
DEFAULT	4556 ms	284.7 ms	206.05 ms	209.31 ms	220.40 ms
HEURISTIC	1488 ms	93.0 ms	4.27 ms	9.54 ms	24.96 ms
EXHAUSTIVE	1424 ms	89.0 ms	4.17 ms	9.73 ms	24.84 ms

HEURISTIC matches EXHAUSTIVE runtime quality without the per-new-shape kernel-search cost that hurts dynamic-shape OCR (the original symptom in #17959).

Test plan

python -c "import ast; ast.parse(open('tools/infer/utility.py').read())" — file still parses
One-line option change; no API or default behavior change for non-ONNX users
CI green

The ONNX Runtime CUDA EP option was set to "DEFAULT", which pins cuDNN to CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM. For many of PP-OCRv5 server rec's conv configs cuDNN rejects this algo and ORT falls back to a non-cuDNN slow path (visible as "OP Conv(...) running in Fallback mode. May be extremely slow." warnings), making steady-state per-shape latency ~50x higher than necessary. "HEURISTIC" picks a supported algo per node via cudnnGetConvolutionForwardAlgorithm_v7, matching the fix landed in PaddleX#5057 for the ultra-infer C++ ORT backend. Benchmark on PP-OCRv5 server rec ONNX (RTX 2080 Ti, onnxruntime-gpu 1.20.0, CUDA 12.6 / cuDNN 9), shapes [B,3,48,W]: mode cold total steady (1,160) (4,320) (8,480) DEFAULT 4556 ms 206.05 ms 209.31 220.40 HEURISTIC 1488 ms 4.27 ms 9.54 24.96 EXHAUSTIVE 1424 ms 4.17 ms 9.73 24.84 HEURISTIC matches EXHAUSTIVE quality without the per-new-shape benchmark cost EXHAUSTIVE incurs on dynamic-shape OCR workloads. Refs: PaddleOCR#17959, PaddleX#5057 Signed-off-by: Bvicii <yizhanhuang2002@gmail.com>

paddle-bot · 2026-04-24T21:44:03Z

Thanks for your contribution!

Signed-off-by: Bvicii <yizhanhuang2002@gmail.com>

style: apply black formatting to provider dict

46f223c

Signed-off-by: Bvicii <yizhanhuang2002@gmail.com>

scyyh11 requested a review from Bobholamovic April 27, 2026 04:41

Merge branch 'main' into fix/onnx-cudnn-conv-algo-heuristic

da14f7a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(onnx): use HEURISTIC cudnn_conv_algo_search for ORT GPU session#17970

fix(onnx): use HEURISTIC cudnn_conv_algo_search for ORT GPU session#17970
scyyh11 wants to merge 3 commits intoPaddlePaddle:mainfrom
scyyh11:fix/onnx-cudnn-conv-algo-heuristic

scyyh11 commented Apr 24, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

scyyh11 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark

Test plan

Uh oh!

paddle-bot Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

scyyh11 commented Apr 24, 2026 •

edited

Loading