ggml/ggml.c and ggml/ggml-opt.cpp. Both now
include the same #ifdef USING_R macro block that
neutralizes printf, fprintf,
fputs, fflush, stderr, and
stdout. These calls were diagnostic-only and were already
silent at runtime via the installed log callback; now the symbols never
reach the compiled object files either.Grammar-constrained generation
(edge_grammar_completion()): Force model output to conform
to a GBNF grammar specification. Ensures valid, parseable structured
output (JSON, enums, numbers, etc.) using llama.cpp’s native grammar
sampler.
JSON schema helper
(edge_json_grammar()): Convert a simple R list schema into
a GBNF grammar string. Supports string, number, integer, boolean fields
and enum (character vector) constraints.
Structured data extraction
(edge_extract()): High-level function that combines prompt
construction with grammar-constrained generation to extract structured
data from text. Returns a parsed R list (requires
jsonlite).
Text classification
(edge_classify()): Classify text into predefined categories
using grammar constraints. Supports single text and batch (vectorized)
classification. Output is guaranteed to be one of the specified
categories.
Text embeddings
(edge_embeddings()): Extract dense vector embeddings from
any loaded model. Returns a numeric matrix (n_texts x n_embd) suitable
for clustering, semantic search, similarity computation, and RAG
pipelines. Supports optional L2 normalization.
Cosine similarity
(edge_similarity(), edge_similarity_matrix()):
Compute pairwise cosine similarity between embedding vectors. Matrix
version efficiently computes all-pairs similarity using normalized
matrix multiply.
Embedding dimension query
(edge_model_n_embd()): Query the embedding dimension of a
loaded model.
Batch processing (edge_map()):
Apply a prompt template over a vector of texts with progress reporting.
Supports both string templates with {text} placeholder and
custom prompt functions. Optional grammar constraint for structured
batch output.
Batch extraction
(edge_extract_batch()): Extract structured data from
multiple texts, returning a data frame with one row per input.
RAG document indexing
(edge_index_documents()): Build a semantic embedding index
from a directory of text files or a character vector. Automatic chunking
with configurable size and overlap.
RAG semantic search
(edge_search()): Find the most relevant text chunks for a
query using cosine similarity over the embedding index.
RAG question answering
(edge_ask()): Retrieval-augmented generation that retrieves
relevant context from an index and generates a grounded answer. Supports
custom system prompts and optional context return for
debugging/transparency.
Plumber API server (edge_serve()):
Serve a model as a local OpenAI-compatible REST API. Endpoints:
/v1/completions, /v1/chat/completions,
/v1/embeddings, /v1/models,
/health. Supports optional API key authentication and CORS.
Requires plumber.
Qwen3 model family in
edge_list_models(): Added Qwen3-0.6B, 1.7B, 4B, and 8B
pre-configured entries from the unsloth GGUF repository.
Friendly names in
edge_download_model(): Now accepts model names
from edge_list_models() (e.g.,
edge_download_model("Qwen3-0.6B")) in addition to
HuggingFace repo IDs. Filename is auto-resolved from the model
registry.
httr download fallback:
.robust_download() now tries httr::GET before
R’s download.file, improving reliability on corporate
networks with custom SSL certificates or proxy configurations.
SIMD optimization warning: On package load,
warns if running without SIMD (generic mode) and suggests reinstalling
from source with EDGEMODELR_SIMD=NATIVE for faster
inference.
Fixed grammar-constrained generation failures
(issue #41): edge_grammar_completion(),
edge_extract(), and edge_extract_batch() were
unusable due to two bugs. First, edge_json_grammar()
emitted rule names like field_1 containing underscores,
which llama.cpp’s grammar parser rejects (only [a-zA-Z0-9-]
is allowed in rule identifiers). Renamed to field-1.
Second, llama_sampler_accept() throws “Unexpected empty
grammar stack” when a token fully satisfies the grammar; the binding now
catches this and terminates cleanly, same as end-of-generation
handling.
Fixed crash from silent context size override
(issue #40 item 11): Removed the auto-reduction of n_ctx
for small models that silently changed the user’s requested context
size. This caused segfaults when prompts exceeded the reduced context.
Context is now used as-is. Minimum n_ctx lowered from 512
to 128 for short-task use cases.
Fixed prompt echo in completion output (issue
#40 item 1): edge_completion() previously returned
prompt + generated_text. Now returns only the generated
text, matching user expectations.
Added prompt length validation: All completion
functions now validate that the tokenized prompt fits within the model’s
context window before calling llama_decode(). Exceeding the
context now raises a clear R error instead of crashing the
process.
Model-native chat templates (issue #40 item 7):
New edge_chat_completion() function reads the model’s chat
template from GGUF metadata (via llama_chat_apply_template)
and formats messages correctly for each model architecture (ChatML,
Llama, Gemma, etc.). build_chat_prompt() updated to accept
an optional ctx parameter for native template formatting,
with ChatML as the generic fallback (replacing the old
Human:/Assistant: format).
edge_classify(ctx, text, c("positive", "negative", "neutral"))edge_extract(ctx, text, list(name = "string", role = "string"))edge_install_cuda() and
edge_install_cuda_toolkit() functions set up GPU inference
automatically.
edge_install_cuda() downloads the matching
ggml-cuda dynamic backend from llama.cpp releases and
extracts the companion ggml-base.dll /
ggml.dll runtime libraries.edge_install_cuda_toolkit() copies
nvcudart_hybrid64.dll from the Windows DriverStore (already
on any NVIDIA-driver machine, no download required) and fetches
cublas64 / cublasLt64 from NVIDIA’s redistrib
server.edge_reload_cuda() activates the CUDA backend in the
current R session without restarting R.edge_cuda_info() reports whether CUDA is installed and
active.n_gpu_layers = -1L to
edge_load_model() for full GPU offload.std::regex to spend 40+ minutes in exponential
backtracking. Added a hand-written fast path
unicode_regex_split_custom_qwen2() in
unicode.cpp, matching the logic of the existing llama-3
fast path. Qwen3-14B now loads in 0.3 s on CPU (3.4 s on GPU including
VRAM transfer). Covers QWEN2 and QWEN3.5 variants.abort() in ggml_abort() with
raise(SIGABRT) under #ifdef USING_R; replaces
abort() token in ggml.cpp with
std::terminate().ggml_print_backtrace() body and
fflush(stdout) / fprintf(stderr, …) in
ggml_abort() with #ifndef USING_R to remove
_Exit, stdout, and stderr symbol
references from ggml.o on macOS.#define _GNU_SOURCE to ggml-cpu.c
(required for SCHED_BATCH, CPU_ZERO,
pthread_setaffinity_np on Linux).CXX_STD = CXX17 replaces -std=c++17 in
PKG_CXXFLAGS in both Makevars and
Makevars.win.-fno-builtin-printf added to GGML_CFLAGS
to suppress printf → puts optimizations.edge_install_cuda,
edge_install_cuda_toolkit, edge_reload_cuda,
edge_cuda_info.Flash attention support: Enabled by default in
edge_load_model() via flash_attn = TRUE.
Reduces memory usage and improves attention computation speed on
CPU.
Full hardware thread utilization: Removed the
4-thread cap for small contexts. edge_load_model() now uses
all available CPU threads by default, with n_threads_batch
set to max for prompt processing.
User-configurable threading: New
n_threads parameter in edge_load_model()
allows explicit control over CPU thread count. Pass NULL
(default) for auto-detect or an integer to limit cores.
Apple Accelerate framework (macOS): Automatically links the Accelerate framework on macOS builds, enabling hardware-accelerated vDSP vector operations for faster matrix math.
Compiler auto-vectorization: Added
-ftree-vectorize to GGML compilation flags on all
platforms, allowing GCC/Clang to generate SIMD instructions for eligible
loops beyond the hand-tuned GGML kernels.
SIMD-optimized build system: Replaced generic
scalar fallback with architecture-aware SIMD detection in both
Makevars (Unix) and Makevars.win (Windows)
User-configurable SIMD levels: Set
EDGEMODELR_SIMD environment variable before install to
select optimization level:
GENERIC: Scalar fallback (maximum compatibility)SSE42: SSE4.2 baseline (default on x86_64)AVX: AVX + F16C (Intel Sandy Bridge 2011+)AVX2: AVX2 + FMA + F16C (Intel Haswell 2013+,
recommended)AVX512: AVX-512 (Intel Skylake-X 2017+)NATIVE: Uses -march=native for maximum
performance on the build machineedge_simd_info(): New function to
query compile-time SIMD status including architecture, compiler
features, and GGML optimization flags
x86 architecture-specific quantization: Enabled
optimized x86 quantization kernels (arch/x86/quants.c,
arch/x86/repack.cpp) with SIMD-accelerated dot products and
matrix operations
Fixed donttest examples: Changed
resource-intensive examples from \donttest{} to
\dontrun{} to prevent downloading multi-GB models during
CRAN checks
Fixed M1 Mac compiler warnings: Added explicit
static_cast<> for:
double to float conversions for
temperature/top_p parameterssize_type to int32_t conversions for
buffer size parametersFixed connection handling: Replaced
on.exit() with tryCatch/finally for proper
connection cleanup in loops (thanks @eddelbuettel)
edge_small_model_config() function provides optimized
settings for small models (1B-3B parameters)
edge_find_ollama_models() - Discover all locally
available Ollama models across platforms (Windows, macOS, Linux)edge_load_ollama_model() - Load Ollama models using
convenient SHA-256 hash prefixes instead of full file pathstest_ollama_model_compatibility() - Built-in
compatibility testing for Ollama modelsstd::filesystem on
macOS builds<mach-o/dyld.h> inclusion with direct function
declarations to avoid enum conflicts-march=native, -mtune=native, etc.)
from Makevars for CRAN compatibilityedge_clean_cache() functionedge_load_model() - Load GGUF model files for
inferenceedge_completion() - Generate text completionsedge_stream_completion() - Stream text generation with
real-time callbacksedge_chat_stream() - Interactive chat session with
streaming responsesedge_free_model() - Memory management and cleanupis_valid_model() - Model context validationedge_list_models() - List pre-configured popular
modelsedge_download_model() - Download models from Hugging
Face Hubedge_quick_setup() - One-line model download and
setupThis release provides a complete, production-ready solution for Local Large Language Model Inference Engine in R, enabling private, offline text generation workflows.