ONNX Model Import

ggmlR includes a built-in zero-dependency ONNX loader (hand-written protobuf parser in C). Load any compatible ONNX model and run inference on CPU or Vulkan GPU — no Python, no TensorFlow, no ONNX Runtime required.

library(ggmlR)

1. Load and inspect a model

model <- ggml_onnx_load("path/to/model.onnx")

# Input / output info
cat("Inputs:\n");  print(ggml_onnx_inputs(model))
cat("Outputs:\n"); print(ggml_onnx_outputs(model))

ggml_onnx_inputs() returns a list with name, shape, and dtype for each input tensor.


2. Run inference

Inputs are named R arrays in NCHW order (matching the ONNX model’s expected layout).

# Random image batch — replace with real data
input <- array(runif(1 * 3 * 224 * 224), dim = c(1L, 3L, 224L, 224L))

result <- ggml_onnx_run(model, list(input_name = input))

cat("Output shape:", paste(dim(result[[1]]), collapse = " x "), "\n")

For models with multiple inputs, pass a named list:

result <- ggml_onnx_run(model, list(
  input_ids      = array(as.integer(tokens), dim = c(1L, length(tokens))),
  attention_mask = array(1L, dim = c(1L, length(tokens)))
))

3. GPU inference

By default ggmlR tries Vulkan first and falls back to CPU automatically. To force a specific backend:

# Check what's available
if (ggml_vulkan_available()) {
  cat("Vulkan GPU ready\n")
  ggml_vulkan_status()
}

# Load with explicit backend hint
model_gpu <- ggml_onnx_load("path/to/model.onnx", backend = "vulkan")
model_cpu <- ggml_onnx_load("path/to/model.onnx", backend = "cpu")

Weights are transferred to the GPU once at load time. Repeated calls to ggml_onnx_run() do not re-transfer weights.


4. Supported operators

ggmlR supports 50+ ONNX operators, including:

Custom fused ops: RelPosBias2D (BoTNet).


5. Examples

For full working examples with real ONNX Zoo models see:

# GPU vs CPU benchmark across multiple models
# inst/examples/benchmark_onnx.R

# FP16 inference benchmark
# inst/examples/benchmark_onnx_fp16.R

# Run all supported ONNX Zoo models
# inst/examples/test_all_onnx.R

# BERT sentence similarity
# inst/examples/bert_similarity.R

6. Debugging tips

If a model fails to load or produces wrong results:

  1. Check operator support — print the model’s op list with Python’s onnx package and compare against the table above.

  2. Verify protobuf field numbers — the built-in parser is hand-written; an unexpected field can cause silent mis-parsing. Dump raw field tags:

# Python: dump all field numbers seen in a TensorProto
import onnx, sys
m = onnx.load(sys.argv[1])
for init in m.graph.initializer:
    raw = init.SerializeToString()
    # inspect field tags with a raw protobuf parser
  1. NaN tracing — use the eval callback for per-node inspection rather than a post-compute scan (which aliases buffers and gives false readings).

  2. Repeated-run aliasingggml_backend_sched aliases intermediate buffers over weight buffers. ggmlR calls sched_alloc_and_load() before each compute to reset allocation. If you see correct results on the first run but garbage on subsequent runs, this is the cause.

See also the ONNX debugging section in CLAUDE.md for field number tables and the Python dump script.