APPLE SILICON · PRIVACY-FIRST · ON-DEVICE

Edge LLM inference,
measured on real hardware.

Real benchmark data from Apple M2 — tokens/sec, energy efficiency, memory bandwidth. Every byte stays on-device. No cloud. No logs. No surveillance.

framework: mlx-lm 0.12.1 device: MacBook Air M2 run: 2026-03-16

peak speed

39.5

tokens / second

first token

280ms

time to first byte

efficiency

2.1

tokens / joule

memory used

2.5GB

of 8 GB Unified

models tested

2

open-weight models

data leaves device

0

privacy guaranteed

Model benchmarks

MacBook Air M2 · macOS 15.5 · click column headers to sort

mlx-lm 0.12.1
MODELQUANTTOK/S LATENCY RAM TOK/J MMLU SCORE
Llama 3.2 1B
1.2B params · Ultra-fast edge inference
Q489.314013.649.3
88
Llama 3.2 3B
3.2B params · On-device assistant
Q439.52802.52.163.4
72

1 warmup pass then 5 timed inference passes on a fixed prompt. Models downloaded from mlx-community on Hugging Face. Measured on live hardware with no other GPU workloads running.

Performance charts

TOKENS / SECOND · M2

MEMORY BANDWIDTH OVER TIME

EFFICIENCY · TOK / JOULE BY CHIP

Zero data egress

Every inference runs entirely on-chip. Prompts, completions, and intermediate activations never touch a network.

Unified memory advantage

Apple's shared CPU/GPU memory pool eliminates costly data transfers, cutting latency and energy use versus discrete GPU setups.

Neural Engine acceleration

The M3 ANE handles quantized model operations at up to 18 TOPS, offloading the CPU and reducing thermal throttling.

Live inference demo

cloud baseline via groq api · same models as the benchmark · latency comparison shown

live inference · cloud baseline

// try a prompt

CLOUD VS LOCAL

This demo uses Groq's cloud API as a reference. The benchmark table shows the same models running locally on M2 — typically ~30% lower latency with zero data leaving the device.

METHODOLOGY

Inferencemlx-lm 0.12.1
EnergymacOS powermetrics
QualityMMLU 5-shot
Runs5 timed passes