Apple Foundation Models (Afm) Metrics

Apple Foundation Models (AFM) Metrics: Technical Deep-Dive

1. Overview & Research Landscape

Apple Foundation Models (AFM) consist of two primary tiers: AFM-on-device (~3B parameters) and AFM-server. These models power Apple Intelligence across the ecosystem. Official technical reports were released in June 2024 and significantly updated in mid-2025 (arXiv:2507.13575).

  • AFM-on-device: A ~3B parameter dense transformer model optimized for local execution on the Apple Neural Engine (ANE).
  • AFM-server: A larger model designed for Private Cloud Compute (PCC), rivaling “frontier” models like GPT-4 in specific task categories.

2. Core Technical Metrics

The following metrics reflect the performance of the 16-bit base models before compression, as reported in the 2025 Apple Intelligence Foundation Language Models Tech Report.

BenchmarkAFM-on-device (~3B)AFM-serverCompetitor Baseline (Llama-3-8B)
MMLU (5-shot)67.880.066.2
GSM8K (8-shot CoT)70.472.479.6
HumanEval (pass@1)16.4830.8433.5
IFEval (Instruction-level)85.189.178.4

Analysis of Scores

  • Instruction Following (IFEval): AFM significantly outperforms larger models like Llama-3-8B, reflecting Apple’s focus on task-oriented fine-tuning.
  • Mathematical Reasoning (GSM8K): AFM-on-device punches significantly above its weight class, outperforming Gemma-7B (46.4) and Mistral-7B (52.2).
  • Coding (HumanEval): AFM’s general language models show modest coding scores; however, Apple utilizes specialized AFM derivatives for Xcode-specific features.

3. Quantization & Efficiency: Low-Bit Palettization

Apple utilizes a proprietary Low-Bit Palettization technique to fit the 3B model into memory-constrained devices (e.g., iPhone 15 Pro with 8GB RAM).

  • Mechanism: K-means clustering is used to group weights into a Lookup Table (LUT) of centroids.
  • Mixed-Bit Strategy: Apple employs a variable bitrate approach, averaging 3.5 to 3.7 bits-per-weight (bpw). This involves a mix of 2-bit and 4-bit layers.
  • Accuracy Recovery: To offset the perplexity spike from 2-bit quantization, Apple uses:
    1. Quantization-Aware Training (QAT): Training the model with simulated rounding errors.
    2. LoRA Adapters: 16-bit high-precision adapters (~tens of MBs) are used to “patch” the quantized base model at runtime.
  • Impact: A 2-bit optimized AFM-on-device maintains an MMLU of 64.4 and an IFEval of 82.3, showing minimal degradation from the 16-bit base.

4. Hardware Benchmarks (M-Series & A-Series)

Performance is heavily driven by Memory Bandwidth and the Apple Neural Engine (ANE).

Throughput (Tokens Per Second - TPS)

HardwareAFM-on-device (TPS)Memory Bandwidth
iPhone 15 Pro (A17 Pro)~30 TPS51.2 GB/s
M1 Max~33 TPS400 GB/s
M4 Max~58.7 TPS546 GB/s

Latency (Time To First Token - TTFT)

  • Prompt Latency: ~0.6ms per token (iPhone 15 Pro).
  • M-Series TTFT: Generally sub-second for warm starts; cold starts (loading from SSD into Unified Memory) can take 2–7 seconds depending on chip generation.
  • Optimization: KV-cache sharing reduces memory usage by 37.5%, improving throughput and reducing TTFT for longer contexts.

5. Contradiction Detection: Human Preference vs. Raw Logic

There is a documented “Helpfulness Gap” in Apple’s modeling strategy.

  • The Contradiction: While AFM-on-device (~3B) scores lower on raw parameters and general knowledge (MMLU) than Mistral-7B or Llama-3-8B, it consistently matches or exceeds them in Human Preference ratings.
  • Reasoning: Apple prioritizes Instruction Following (IFEval) and Safety through heavy Reinforcement Learning from Human Feedback (RLHF).
  • Observation: AFM over-performs in “Human-likeness” and task completion (summarization, email drafting) compared to its performance on “Raw Logic” benchmarks like HumanEval. This suggests the model is highly specialized for “Digital Assistant” personas rather than general-purpose reasoning.

Gardener’s Summary

For the developer focusing on ML Development, AFM’s primary value lies in its latency-to-utility ratio. While it is not a “frontier” reasoning model like GPT-4o or Claude 3.5 Sonnet, its ability to run at ~60 TPS on M4 hardware with high instruction-following accuracy makes it the ideal engine for Local Agents and background system tasks.

Sources

  • Apple Intelligence Foundation Language Models Tech Report 2025 (arXiv:2507.13575)
  • Apple Intelligence Foundation Language Models (arXiv:2407.21075)
  • Apple Machine Learning Research: Introducing Apple’s On-Device and Server Foundation Models (June 2024)
  • Apple Developer Documentation: FoundationModels
  • Internal stress tests (M1 Max vs M4 Max, 2026) via [[apfel_deep_dive_raw]]