Benchmarks

Real-world inference performance on tested models. Updated as we add support for more architectures.

Test Configuration

GPU

NVIDIA B200

Runtime

xCore v0.x

Prompt Length

512 tokens

Batch Size

1

Model Precision Params Decode (tok/s) TTFT (ms) VRAM (GB)
Nemotron-Super-120B NVFP4 120B (12B active)
Llama 3.1 70B FP8 70B
Llama 3.1 8B BF16 8B
Mistral 7B v0.3 FP8 7B
Qwen2.5 72B NVFP4 72B
Phi-3 Mini 3.8B BF16 3.8B

Numbers marked with — are pending publication. More models coming soon.