APPLE SILICON · PRIVACY-FIRST · ON-DEVICE

Edge LLM inference,
measured on real hardware.

Real benchmark data from Apple M2 — tokens/sec, energy efficiency, memory bandwidth. Every byte stays on-device. No cloud. No logs. No surveillance.

→ framework: mlx-lm 0.12.1→ device: MacBook Air M2→ run: 2026-03-16

peak speed

39.5

tokens / second

first token

280ms

time to first byte

efficiency

2.1

tokens / joule

memory used

2.5GB

of 8 GB Unified

models tested

open-weight models

data leaves device

privacy guaranteed

Model benchmarks

MacBook Air M2 · macOS 15.5 · click column headers to sort

mlx-lm 0.12.1

MODEL	QUANT	TOK/S ↓	LATENCY	RAM	TOK/J	MMLU	SCORE
Llama 3.2 1B 1.2B params · Ultra-fast edge inference	Q4	89.3	140	1	3.6	49.3	88
Llama 3.2 3B 3.2B params · On-device assistant	Q4	39.5	280	2.5	2.1	63.4	72

1 warmup pass then 5 timed inference passes on a fixed prompt. Models downloaded from mlx-community on Hugging Face. Measured on live hardware with no other GPU workloads running.

Performance charts

TOKENS / SECOND · M2

MEMORY BANDWIDTH OVER TIME

EFFICIENCY · TOK / JOULE BY CHIP

Zero data egress

Every inference runs entirely on-chip. Prompts, completions, and intermediate activations never touch a network.

Unified memory advantage

Apple's shared CPU/GPU memory pool eliminates costly data transfers, cutting latency and energy use versus discrete GPU setups.

Neural Engine acceleration

The M3 ANE handles quantized model operations at up to 18 TOPS, offloading the CPU and reducing thermal throttling.

Live inference demo

cloud baseline via groq api · same models as the benchmark · latency comparison shown

live inference · cloud baseline

// try a prompt

CLOUD VS LOCAL

This demo uses Groq's cloud API as a reference. The benchmark table shows the same models running locally on M2 — typically ~30% lower latency with zero data leaving the device.

METHODOLOGY

Inferencemlx-lm 0.12.1

EnergymacOS powermetrics

QualityMMLU 5-shot

Runs5 timed passes

Edge LLM inference,measured on real hardware.

Model benchmarks

Performance charts

Live inference demo

Edge LLM inference,
measured on real hardware.