Now Available: NVIDIA Blackwell Optimized

The fastest way to
run AI locally

xCore the native runtime for NVIDIA Blackwell + Ampere. Peak performance on every device, from edge to rack. Zero-latency inference with unified memory optimization.

xanuedge-cli --stream

$ xcore serve --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

✓ Loaded 50 CuTile kernels (3.1 MB)

✓ NVFP4 weights loaded in xxs (xx GB)

✓ Listening on http://localhost:8080 — xx tok/s decode

✔ Model ready. Metrics:

TOKENS/SEC

xxx

LATENCY (TTFT)

xx.xms

UTILIZATION

xx.x%

Native GPU Kernels

50+ hand-tuned CuTile kernels compiled directly to GPU machine code. Zero cuBLAS*, zero Python, zero overhead between your model and the silicon.

* cuBLAS-LT bridges NVFP4 until CuTile adds native FP4 support

Run Any Model, Any Precision

BF16, FP8, and NVFP4 weight formats with automatic detection. Every kernel hand-written for the critical path — no generic fallbacks.

Deploy Anywhere

One runtime from a 100W edge device to a pro workstation. OpenAI-compatible API, continuous batching, and streaming — same binary, any Blackwell GPU.

Ready for the Next Generation?

Join the early access program and experience the speed of Xanuedge in your own environment.