edgemodelr 0.4.1

CRAN Resubmission Fixes

Stderr references in compiled objects (CRAN auto-check NOTE on Debian): the previous CRAN cleanup (commit d8870bd) added stdio suppression to 7 upstream files but missed ggml/ggml.c and ggml/ggml-opt.cpp. Both now include the same #ifdef USING_R macro block that neutralizes printf, fprintf, fputs, fflush, stderr, and stdout. These calls were diagnostic-only and were already silent at runtime via the installed log callback; now the symbols never reach the compiled object files either.

edgemodelr 0.4.0

Structured Output, Embeddings, RAG, and API Server

New Features

Grammar-constrained generation (edge_grammar_completion()): Force model output to conform to a GBNF grammar specification. Ensures valid, parseable structured output (JSON, enums, numbers, etc.) using llama.cpp’s native grammar sampler.
JSON schema helper (edge_json_grammar()): Convert a simple R list schema into a GBNF grammar string. Supports string, number, integer, boolean fields and enum (character vector) constraints.
Structured data extraction (edge_extract()): High-level function that combines prompt construction with grammar-constrained generation to extract structured data from text. Returns a parsed R list (requires jsonlite).
Text classification (edge_classify()): Classify text into predefined categories using grammar constraints. Supports single text and batch (vectorized) classification. Output is guaranteed to be one of the specified categories.
Text embeddings (edge_embeddings()): Extract dense vector embeddings from any loaded model. Returns a numeric matrix (n_texts x n_embd) suitable for clustering, semantic search, similarity computation, and RAG pipelines. Supports optional L2 normalization.
Cosine similarity (edge_similarity(), edge_similarity_matrix()): Compute pairwise cosine similarity between embedding vectors. Matrix version efficiently computes all-pairs similarity using normalized matrix multiply.
Embedding dimension query (edge_model_n_embd()): Query the embedding dimension of a loaded model.
Batch processing (edge_map()): Apply a prompt template over a vector of texts with progress reporting. Supports both string templates with {text} placeholder and custom prompt functions. Optional grammar constraint for structured batch output.
Batch extraction (edge_extract_batch()): Extract structured data from multiple texts, returning a data frame with one row per input.
RAG document indexing (edge_index_documents()): Build a semantic embedding index from a directory of text files or a character vector. Automatic chunking with configurable size and overlap.
RAG semantic search (edge_search()): Find the most relevant text chunks for a query using cosine similarity over the embedding index.
RAG question answering (edge_ask()): Retrieval-augmented generation that retrieves relevant context from an index and generates a grounded answer. Supports custom system prompts and optional context return for debugging/transparency.
Plumber API server (edge_serve()): Serve a model as a local OpenAI-compatible REST API. Endpoints: /v1/completions, /v1/chat/completions, /v1/embeddings, /v1/models, /health. Supports optional API key authentication and CORS. Requires plumber.
Qwen3 model family in edge_list_models(): Added Qwen3-0.6B, 1.7B, 4B, and 8B pre-configured entries from the unsloth GGUF repository.
Friendly names in edge_download_model(): Now accepts model names from edge_list_models() (e.g., edge_download_model("Qwen3-0.6B")) in addition to HuggingFace repo IDs. Filename is auto-resolved from the model registry.
httr download fallback: .robust_download() now tries httr::GET before R’s download.file, improving reliability on corporate networks with custom SSL certificates or proxy configurations.
SIMD optimization warning: On package load, warns if running without SIMD (generic mode) and suggests reinstalling from source with EDGEMODELR_SIMD=NATIVE for faster inference.

Bug Fixes

Fixed grammar-constrained generation failures (issue #41): edge_grammar_completion(), edge_extract(), and edge_extract_batch() were unusable due to two bugs. First, edge_json_grammar() emitted rule names like field_1 containing underscores, which llama.cpp’s grammar parser rejects (only [a-zA-Z0-9-] is allowed in rule identifiers). Renamed to field-1. Second, llama_sampler_accept() throws “Unexpected empty grammar stack” when a token fully satisfies the grammar; the binding now catches this and terminates cleanly, same as end-of-generation handling.
Fixed crash from silent context size override (issue #40 item 11): Removed the auto-reduction of n_ctx for small models that silently changed the user’s requested context size. This caused segfaults when prompts exceeded the reduced context. Context is now used as-is. Minimum n_ctx lowered from 512 to 128 for short-task use cases.
Fixed prompt echo in completion output (issue #40 item 1): edge_completion() previously returned prompt + generated_text. Now returns only the generated text, matching user expectations.
Added prompt length validation: All completion functions now validate that the tokenized prompt fits within the model’s context window before calling llama_decode(). Exceeding the context now raises a clear R error instead of crashing the process.
Model-native chat templates (issue #40 item 7): New edge_chat_completion() function reads the model’s chat template from GGUF metadata (via llama_chat_apply_template) and formats messages correctly for each model architecture (ChatML, Llama, Gemma, etc.). build_chat_prompt() updated to accept an optional ctx parameter for native template formatting, with ChatML as the generic fallback (replacing the old Human:/Assistant: format).

Use Cases Unlocked

Sentiment analysis: edge_classify(ctx, text, c("positive", "negative", "neutral"))
Entity extraction: edge_extract(ctx, text, list(name = "string", role = "string"))
Data labeling: Batch classify thousands of rows with guaranteed valid labels
Semantic search: Embed documents and queries, find nearest neighbors
Document clustering: Compute similarity matrices, feed to hclust/kmeans
RAG foundations: Embed corpus, retrieve relevant context for generation

edgemodelr 0.3.0

CUDA GPU Support and Qwen3 Tokenizer Fix

New Features

CUDA GPU acceleration (Windows): New edge_install_cuda() and edge_install_cuda_toolkit() functions set up GPU inference automatically.
- edge_install_cuda() downloads the matching ggml-cuda dynamic backend from llama.cpp releases and extracts the companion ggml-base.dll / ggml.dll runtime libraries.
- edge_install_cuda_toolkit() copies nvcudart_hybrid64.dll from the Windows DriverStore (already on any NVIDIA-driver machine, no download required) and fetches cublas64 / cublasLt64 from NVIDIA’s redistrib server.
- edge_reload_cuda() activates the CUDA backend in the current R session without restarting R.
- edge_cuda_info() reports whether CUDA is installed and active.
- Pass n_gpu_layers = -1L to edge_load_model() for full GPU offload.
- Tested on NVIDIA RTX 5070 Ti (Blackwell sm_120, CUDA 13.1, 12 GB VRAM): Qwen3-14B loads in 3.4 s with full VRAM offload.
Updated llama.cpp to build b8179 (GGML 0.9.7): Brings all upstream model architecture updates, sampler improvements, and quantization fixes.

Bug Fixes

Qwen3 / QWEN2 tokenizer 40-minute load time (8000× speedup): The QWEN2 byte-level regex pattern caused GCC’s std::regex to spend 40+ minutes in exponential backtracking. Added a hand-written fast path unicode_regex_split_custom_qwen2() in unicode.cpp, matching the logic of the existing llama-3 fast path. Qwen3-14B now loads in 0.3 s on CPU (3.4 s on GPU including VRAM transfer). Covers QWEN2 and QWEN3.5 variants.

CRAN Compliance

Replaced abort() in ggml_abort() with raise(SIGABRT) under #ifdef USING_R; replaces abort() token in ggml.cpp with std::terminate().
Guarded ggml_print_backtrace() body and fflush(stdout) / fprintf(stderr, …) in ggml_abort() with #ifndef USING_R to remove _Exit, stdout, and stderr symbol references from ggml.o on macOS.
Added #define _GNU_SOURCE to ggml-cpu.c (required for SCHED_BATCH, CPU_ZERO, pthread_setaffinity_np on Linux).
CXX_STD = CXX17 replaces -std=c++17 in PKG_CXXFLAGS in both Makevars and Makevars.win.
-fno-builtin-printf added to GGML_CFLAGS to suppress printf → puts optimizations.
Man pages added for edge_install_cuda, edge_install_cuda_toolkit, edge_reload_cuda, edge_cuda_info.

edgemodelr 0.2.0

SIMD Optimizations for Faster CPU Inference

New Features

Flash attention support: Enabled by default in edge_load_model() via flash_attn = TRUE. Reduces memory usage and improves attention computation speed on CPU.
Full hardware thread utilization: Removed the 4-thread cap for small contexts. edge_load_model() now uses all available CPU threads by default, with n_threads_batch set to max for prompt processing.
User-configurable threading: New n_threads parameter in edge_load_model() allows explicit control over CPU thread count. Pass NULL (default) for auto-detect or an integer to limit cores.
Apple Accelerate framework (macOS): Automatically links the Accelerate framework on macOS builds, enabling hardware-accelerated vDSP vector operations for faster matrix math.
Compiler auto-vectorization: Added -ftree-vectorize to GGML compilation flags on all platforms, allowing GCC/Clang to generate SIMD instructions for eligible loops beyond the hand-tuned GGML kernels.

Existing Features

SIMD-optimized build system: Replaced generic scalar fallback with architecture-aware SIMD detection in both Makevars (Unix) and Makevars.win (Windows)
- x86_64: Enables SSE4.2 baseline by default (universal since Intel Nehalem 2008)
- aarch64/arm64: NEON support built into the ABI (no extra flags needed)
- Other architectures: Automatic generic fallback
User-configurable SIMD levels: Set EDGEMODELR_SIMD environment variable before install to select optimization level:
- GENERIC: Scalar fallback (maximum compatibility)
- SSE42: SSE4.2 baseline (default on x86_64)
- AVX: AVX + F16C (Intel Sandy Bridge 2011+)
- AVX2: AVX2 + FMA + F16C (Intel Haswell 2013+, recommended)
- AVX512: AVX-512 (Intel Skylake-X 2017+)
- NATIVE: Uses -march=native for maximum performance on the build machine
edge_simd_info(): New function to query compile-time SIMD status including architecture, compiler features, and GGML optimization flags
x86 architecture-specific quantization: Enabled optimized x86 quantization kernels (arch/x86/quants.c, arch/x86/repack.cpp) with SIMD-accelerated dot products and matrix operations

Performance

15-40% faster inference on x86_64 with SSE4.2 baseline vs generic scalar
Up to 2-3x faster with AVX2 for quantized model operations
SSSE3-accelerated integer multiply-accumulate for quantized dot products

edgemodelr 0.1.5

CRAN Policy Fixes

Bug Fixes

Fixed donttest examples: Changed resource-intensive examples from \donttest{} to \dontrun{} to prevent downloading multi-GB models during CRAN checks
Fixed M1 Mac compiler warnings: Added explicit static_cast<> for:
- double to float conversions for temperature/top_p parameters
- size_type to int32_t conversions for buffer size parameters
Fixed connection handling: Replaced on.exit() with tryCatch/finally for proper connection cleanup in loops (thanks @eddelbuettel)

edgemodelr 0.1.4

Performance Optimizations for Small Language Models

New Features

Small Model Configuration Helper: New edge_small_model_config() function provides optimized settings for small models (1B-3B parameters)
- Device-specific presets: mobile, laptop, desktop, and server
- Adaptive configuration based on model size and available RAM
- Built-in performance tips and recommendations
- Automatic parameter tuning for optimal inference speed
Adaptive Batch Processing: Intelligent batch size optimization based on context length
- Small contexts (≤512): Uses up to full context for batching
- Medium contexts (512-2048): Uses 1/2 context for optimal throughput
- Large contexts (2048-4096): Uses 1/4 context to balance speed and memory
- Very large contexts (>4096): Caps at 2048 tokens for stability
Smart Thread Allocation: Context-aware CPU thread management
- Small models automatically limit threads to avoid overhead
- Reduces CPU contention on resource-constrained devices
- Improves inference speed for models with contexts ≤2048 tokens
Automatic Context Optimization: Model size-based context tuning
- Small models (<1GB): Optimized to 1024 tokens for faster inference
- Medium models (1-2GB): Set to 1536 tokens for balanced performance
- Large models (>2GB): Maintains 2048+ tokens for quality
- User override available via n_ctx parameter

Performance Improvements

Faster Small Model Inference: 15-30% speed improvement for small models through optimized batch and thread settings
Reduced Memory Footprint: Better memory efficiency for resource-constrained environments
Lower Latency: Optimized thread allocation reduces context switching overhead
Better Scalability: Adaptive configurations scale from mobile devices to servers

Examples and Documentation

Small Model Optimization Example: Comprehensive example demonstrating all optimization features
- Configuration comparison across device types
- Performance benchmarking workflow
- Best practices for different model sizes
- Manual tuning guidelines
Enhanced Testing: New test suite for small model configuration
- Tests for all device target configurations
- Validation of adaptive parameter adjustments
- Safety checks for edge cases

Technical Details

Improved C++ bindings with adaptive batch size calculations
Enhanced R API with intelligent parameter defaults
Better integration between model size detection and configuration
Comprehensive documentation for optimization features

edgemodelr 0.1.2

Major New Features

Ollama Integration

Native Ollama Support: Complete integration with Ollama models through automatic model discovery and SHA-256 hash-based loading
edge_find_ollama_models() - Discover all locally available Ollama models across platforms (Windows, macOS, Linux)
edge_load_ollama_model() - Load Ollama models using convenient SHA-256 hash prefixes instead of full file paths
test_ollama_model_compatibility() - Built-in compatibility testing for Ollama models
Cross-platform Model Detection: Robust model discovery supporting standard installations, snap packages (Linux), and various Windows configurations
Windows OneDrive Compatibility: Enhanced path detection that properly handles Windows OneDrive document folder redirections

Comprehensive Examples Suite

Structured Learning Path: Complete examples directory with progressive difficulty levels (Beginner → Intermediate → Advanced)
01_basic_usage.R: Fundamental operations including model loading, text generation, parameter tuning, and error handling
02_ollama_integration.R: Complete Ollama workflow with model discovery, hash-based loading, and compatibility testing
03_streaming_generation.R: Real-time streaming text generation with interactive chat interfaces and callback processing
04_performance_optimization.R: Advanced performance tuning including GPU acceleration, benchmarking, memory management, and batch processing
examples/README.md: Comprehensive documentation with learning paths, troubleshooting guide, and customization instructions

Package Structure Improvements

Organized File Structure: Consolidated all examples into structured examples/ directory with consistent formatting
Enhanced Documentation: Improved inline documentation and example comments throughout

edgemodelr 0.1.1

Bug Fixes and Improvements

Compilation Fixes

macOS Boolean Conflicts: Completely resolved Boolean enum conflicts by avoiding problematic system headers and using direct function declarations
Filesystem Compatibility: Added comprehensive fallback implementation for disabled std::filesystem on macOS builds
Header Protection: Implemented robust cross-platform header inclusion strategy that works with R, Rcpp, and system headers
System Header Workarounds: Replaced <mach-o/dyld.h> inclusion with direct function declarations to avoid enum conflicts
Format Attribute Warnings: Suppressed unsupported printf format attribute warnings on macOS Apple Clang compiler
CRAN Compliance: Removed non-portable optimization flags (-march=native, -mtune=native, etc.) from Makevars for CRAN compatibility
Cross-platform Build: Enhanced Makevars configuration for better macOS compatibility with R package requirements

Demo and Documentation Updates

Modern UI: Updated streaming chat demo with modern bslib interface for enhanced user experience
Documentation: Improved documentation for edge_clean_cache() function
Examples: Enhanced streaming chat example with better UI components

Technical Improvements

Build System: Updated Makevars files for improved compilation on Windows and Unix systems
Core Bindings: Enhanced C++ bindings for better performance and stability

edgemodelr 0.1.0

Initial CRAN Release

New Features

Local LLM Inference: Complete R interface for running large language models locally using llama.cpp and GGUF model files
Model Management: Built-in functions for downloading and managing popular models from Hugging Face
Text Generation: Support for both blocking and streaming text completion
Interactive Chat: Real-time streaming chat interface with conversation history
Privacy-First: All processing happens locally without external API calls

Core Functions

edge_load_model() - Load GGUF model files for inference
edge_completion() - Generate text completions
edge_stream_completion() - Stream text generation with real-time callbacks
edge_chat_stream() - Interactive chat session with streaming responses
edge_free_model() - Memory management and cleanup
is_valid_model() - Model context validation

Model Management

edge_list_models() - List pre-configured popular models
edge_download_model() - Download models from Hugging Face Hub
edge_quick_setup() - One-line model download and setup

System Support

Self-contained: Includes complete llama.cpp implementation
Cross-platform: Works on Windows, macOS, and Linux
CPU optimized: Runs efficiently on standard hardware
Memory efficient: Support for quantized models

Documentation

Comprehensive getting started vignette
Complete API documentation with examples
README with extensive usage examples
Test coverage for all major functionality

Technical Implementation

C++17 integration via Rcpp
Real-time token streaming with callback support
Automatic memory management with RAII
Robust error handling and validation
Thread-safe model operations

This release provides a complete, production-ready solution for Local Large Language Model Inference Engine in R, enabling private, offline text generation workflows.