Describe the feature request
Description:
ONNX Runtime supports int4/uint4 mainly on x86 (AVX2/AVX-512), but lacks optimized support for IBM Power Systems (PowerPC). This limits efficient inference of 4-bit quantized models on this architecture.
Proposal:
- Add VSX + MMA optimized kernels (e.g., GEMM/MatMul, dot products)
- Extend MLAS (or equivalent) with PowerPC paths
- Support existing packed int4 formats
Notes:
- VSX can handle unpacking; MMA can accelerate matrix multiply/accumulation
- Approach can follow existing x86 implementations (e.g., ggml)
Questions:
- Any plans for non-x86 int4 support?
- Preferred integration point (MLAS vs EP)?
Thanks!
Describe scenario use case
Running LLM inference on IBM Power Systems using 4-bit quantized models (e.g., weight-only int4). Without native int4/uint4 support in ONNX Runtime, deployments must fall back to int8 or higher precision, leading to increased memory bandwidth usage and reduced throughput.
Enabling int4 with VSX/MMA would allow efficient execution of quantized GEMM/dot-product workloads, improving performance and reducing memory footprint for large models on PowerPC-based systems.
Describe the feature request
Description:
ONNX Runtime supports int4/uint4 mainly on x86 (AVX2/AVX-512), but lacks optimized support for IBM Power Systems (PowerPC). This limits efficient inference of 4-bit quantized models on this architecture.
Proposal:
Notes:
Questions:
Thanks!
Describe scenario use case
Running LLM inference on IBM Power Systems using 4-bit quantized models (e.g., weight-only int4). Without native int4/uint4 support in ONNX Runtime, deployments must fall back to int8 or higher precision, leading to increased memory bandwidth usage and reduced throughput.
Enabling int4 with VSX/MMA would allow efficient execution of quantized GEMM/dot-product workloads, improving performance and reducing memory footprint for large models on PowerPC-based systems.