Edge LLM inference,
measured on real hardware.
Real benchmark data from Apple M2 — tokens/sec, energy efficiency, memory bandwidth. Every byte stays on-device. No cloud. No logs. No surveillance.
peak speed
39.5
tokens / second
first token
280ms
time to first byte
efficiency
2.1
tokens / joule
memory used
2.5GB
of 8 GB Unified
models tested
2
open-weight models
data leaves device
0
privacy guaranteed
Model benchmarks
MacBook Air M2 · macOS 15.5 · click column headers to sort
1 warmup pass then 5 timed inference passes on a fixed prompt. Models downloaded from mlx-community on Hugging Face. Measured on live hardware with no other GPU workloads running.
Performance charts
TOKENS / SECOND · M2
MEMORY BANDWIDTH OVER TIME
EFFICIENCY · TOK / JOULE BY CHIP
Zero data egress
Every inference runs entirely on-chip. Prompts, completions, and intermediate activations never touch a network.
Unified memory advantage
Apple's shared CPU/GPU memory pool eliminates costly data transfers, cutting latency and energy use versus discrete GPU setups.
Neural Engine acceleration
The M3 ANE handles quantized model operations at up to 18 TOPS, offloading the CPU and reducing thermal throttling.
Live inference demo
cloud baseline via groq api · same models as the benchmark · latency comparison shown
// try a prompt
CLOUD VS LOCAL
METHODOLOGY