Apple Foundation Models (Afm) Metrics
Apple Foundation Models (AFM) Metrics: Technical Deep-Dive
1. Overview & Research Landscape
Apple Foundation Models (AFM) consist of two primary tiers: AFM-on-device (~3B parameters) and AFM-server. These models power Apple Intelligence across the ecosystem. Official technical reports were released in June 2024 and significantly updated in mid-2025 (arXiv:2507.13575).
- AFM-on-device: A ~3B parameter dense transformer model optimized for local execution on the Apple Neural Engine (ANE).
- AFM-server: A larger model designed for Private Cloud Compute (PCC), rivaling “frontier” models like GPT-4 in specific task categories.
2. Core Technical Metrics
The following metrics reflect the performance of the 16-bit base models before compression, as reported in the 2025 Apple Intelligence Foundation Language Models Tech Report.
| Benchmark | AFM-on-device (~3B) | AFM-server | Competitor Baseline (Llama-3-8B) |
|---|---|---|---|
| MMLU (5-shot) | 67.8 | 80.0 | 66.2 |
| GSM8K (8-shot CoT) | 70.4 | 72.4 | 79.6 |
| HumanEval (pass@1) | 16.48 | 30.84 | 33.5 |
| IFEval (Instruction-level) | 85.1 | 89.1 | 78.4 |
Analysis of Scores
- Instruction Following (IFEval): AFM significantly outperforms larger models like Llama-3-8B, reflecting Apple’s focus on task-oriented fine-tuning.
- Mathematical Reasoning (GSM8K): AFM-on-device punches significantly above its weight class, outperforming Gemma-7B (46.4) and Mistral-7B (52.2).
- Coding (HumanEval): AFM’s general language models show modest coding scores; however, Apple utilizes specialized AFM derivatives for Xcode-specific features.
3. Quantization & Efficiency: Low-Bit Palettization
Apple utilizes a proprietary Low-Bit Palettization technique to fit the 3B model into memory-constrained devices (e.g., iPhone 15 Pro with 8GB RAM).
- Mechanism: K-means clustering is used to group weights into a Lookup Table (LUT) of centroids.
- Mixed-Bit Strategy: Apple employs a variable bitrate approach, averaging 3.5 to 3.7 bits-per-weight (bpw). This involves a mix of 2-bit and 4-bit layers.
- Accuracy Recovery: To offset the perplexity spike from 2-bit quantization, Apple uses:
- Quantization-Aware Training (QAT): Training the model with simulated rounding errors.
- LoRA Adapters: 16-bit high-precision adapters (~tens of MBs) are used to “patch” the quantized base model at runtime.
- Impact: A 2-bit optimized AFM-on-device maintains an MMLU of 64.4 and an IFEval of 82.3, showing minimal degradation from the 16-bit base.
4. Hardware Benchmarks (M-Series & A-Series)
Performance is heavily driven by Memory Bandwidth and the Apple Neural Engine (ANE).
Throughput (Tokens Per Second - TPS)
| Hardware | AFM-on-device (TPS) | Memory Bandwidth |
|---|---|---|
| iPhone 15 Pro (A17 Pro) | ~30 TPS | 51.2 GB/s |
| M1 Max | ~33 TPS | 400 GB/s |
| M4 Max | ~58.7 TPS | 546 GB/s |
Latency (Time To First Token - TTFT)
- Prompt Latency: ~0.6ms per token (iPhone 15 Pro).
- M-Series TTFT: Generally sub-second for warm starts; cold starts (loading from SSD into Unified Memory) can take 2–7 seconds depending on chip generation.
- Optimization: KV-cache sharing reduces memory usage by 37.5%, improving throughput and reducing TTFT for longer contexts.
5. Contradiction Detection: Human Preference vs. Raw Logic
There is a documented “Helpfulness Gap” in Apple’s modeling strategy.
- The Contradiction: While AFM-on-device (~3B) scores lower on raw parameters and general knowledge (MMLU) than Mistral-7B or Llama-3-8B, it consistently matches or exceeds them in Human Preference ratings.
- Reasoning: Apple prioritizes Instruction Following (IFEval) and Safety through heavy Reinforcement Learning from Human Feedback (RLHF).
- Observation: AFM over-performs in “Human-likeness” and task completion (summarization, email drafting) compared to its performance on “Raw Logic” benchmarks like HumanEval. This suggests the model is highly specialized for “Digital Assistant” personas rather than general-purpose reasoning.
Gardener’s Summary
For the developer focusing on ML Development, AFM’s primary value lies in its latency-to-utility ratio. While it is not a “frontier” reasoning model like GPT-4o or Claude 3.5 Sonnet, its ability to run at ~60 TPS on M4 hardware with high instruction-following accuracy makes it the ideal engine for Local Agents and background system tasks.
Sources
- Apple Intelligence Foundation Language Models Tech Report 2025 (arXiv:2507.13575)
- Apple Intelligence Foundation Language Models (arXiv:2407.21075)
- Apple Machine Learning Research: Introducing Apple’s On-Device and Server Foundation Models (June 2024)
- Apple Developer Documentation: FoundationModels
- Internal stress tests (M1 Max vs M4 Max, 2026) via [[apfel_deep_dive_raw]]