On-Premise LLM Runtime
All work
Infrastructure

On-Premise LLM Runtime

70B+ parameter models on-premiseZero cloud dependencyPrivate inference at scale

Local inference runtime distributed across multiple nodes running 50+ models — large language models, vision models, embedding models, NLP pipelines, and classification models. No cloud dependency for inference. Models are loaded and available across the mesh with a unified gateway routing requests to the node that has the requested model in memory.

Hardware-optimised configuration per node: flash attention enabled, quantised KV cache (q8_0) for reduced memory footprint, and persistent model loading with keep-alive set to indefinite — models stay warm in VRAM for instant response without cold-start latency. Parallelism tuned per node based on available memory: nodes with large unified memory pools handle 4 concurrent requests, dedicated GPU nodes run single-request serial execution to avoid VRAM pressure.

Managed via system-level service managers (launchd on macOS, systemd on Linux) with health monitoring, automatic restart on failure, and remote management. Model configuration tracks which models are loaded where, memory allocation per model, and hardware capability matching — vision models route to nodes with sufficient VRAM, embedding models to nodes optimised for throughput.

The runtime layer sits beneath every AI capability in the platform. Image analysis, content generation, conversation intelligence, search re-ranking, embedding generation, and vision captioning all run against this infrastructure. The gateway abstraction means no service needs to know which physical machine will handle its inference request — it submits to the gateway, the gateway routes based on model availability and node health.

// Tech stack

Ollamagemma3:27bnomic-embed-text-v2-moeFlash AttentionQuantised KV CachelaunchdsystemdTailscale
Live in production