ONNX Model Import

ggmlR includes a built-in zero-dependency ONNX loader (hand-written protobuf parser in C). Load any compatible ONNX model and run inference on CPU or Vulkan GPU — no Python, no TensorFlow, no ONNX Runtime required.

1. Load and inspect a model

model <- onnx_load("path/to/model.onnx")

# Model summary (layers, ops, parameters)
onnx_summary(model)

# Input tensor info (name, shape, dtype)
onnx_inputs(model)

2. Run inference

Inputs are named R arrays in NCHW order (matching the ONNX model’s expected layout).

# Random image batch — replace with real data
input <- array(runif(1 * 3 * 224 * 224), dim = c(1L, 3L, 224L, 224L))

result <- onnx_run(model, list(input_name = input))

cat("Output shape:", paste(dim(result[[1]]), collapse = " x "), "\n")

For models with multiple inputs, pass a named list:

result <- onnx_run(model, list(
  input_ids      = array(as.integer(tokens), dim = c(1L, length(tokens))),
  attention_mask = array(1L, dim = c(1L, length(tokens)))
))

3. GPU inference

By default ggmlR tries Vulkan first and falls back to CPU automatically. To force a specific backend:

# Check what's available
if (ggml_vulkan_available()) {
  cat("Vulkan GPU ready\n")
  ggml_vulkan_status()
}

# Load with explicit device
model_gpu <- onnx_load("path/to/model.onnx", device = "vulkan")
model_cpu <- onnx_load("path/to/model.onnx", device = "cpu")

Weights are transferred to the GPU once at load time. Repeated calls to onnx_run() do not re-transfer weights.

4. Dynamic input shapes

Some models accept variable-length inputs. Override shapes at load time:

model <- onnx_load("path/to/bert.onnx",
                    input_shapes = list(input_ids = c(1L, 128L)))

5. FP16 inference

Run in half-precision for faster GPU inference:

model_fp16 <- onnx_load("path/to/model.onnx", dtype = "f16")
result <- onnx_run(model_fp16, list(input = input))

6. Supported operators

ggmlR supports 50+ ONNX operators, including:

Convolution: Conv, ConvTranspose, MaxPool, AveragePool, GlobalAveragePool
Linear: Gemm, MatMul, Linear
Activations: Relu, Sigmoid, Tanh, Gelu, HardSigmoid, Mish, Clip, Elu
Normalization: BatchNormalization, LayerNormalization, GroupNormalization
Shape ops: Reshape, Transpose, Flatten, Squeeze, Unsqueeze, Concat, Split, Slice, Gather, ScatterElements
Elementwise: Add, Sub, Mul, Div, Pow, Sqrt, Exp, Log, Abs, Neg
Reduction: ReduceMean, ReduceSum, ReduceMax
Attention: Attention (fused), MultiHeadAttention
Quantized: QLinearConv, QLinearMatMul, DynamicQuantizeLinear
Other: Cast, Pad, Resize, Dropout (identity at inference), LSTM, GRU, Einsum

Custom fused ops: RelPosBias2D (BoTNet).

7. Examples

For full working examples with real ONNX Zoo models see:

# GPU vs CPU benchmark across multiple models
# inst/examples/benchmark_onnx.R

# FP16 inference benchmark
# inst/examples/benchmark_onnx_fp16.R

# Run all supported ONNX Zoo models
# inst/examples/test_all_onnx.R

# BERT sentence similarity
# inst/examples/bert_similarity.R

8. Debugging tips

If a model fails to load or produces wrong results:

Check operator support — print the model’s op list with Python’s onnx package and compare against the table above.
Verify protobuf field numbers — the built-in parser is hand-written; an unexpected field can cause silent mis-parsing.
NaN tracing — use the eval callback for per-node inspection rather than a post-compute scan (which aliases buffers and gives false readings).
Repeated-run aliasing — ggml_backend_sched aliases intermediate buffers over weight buffers. ggmlR calls sched_alloc_and_load() before each compute to reset allocation. If you see correct results on the first run but garbage on subsequent runs, this is the cause.