Model Unlearning & Editing

Model Unlearning & Editing

Overview

Model Unlearning (or Machine Unlearning) is the technical process of removing specific information from a model’s weights after the training phase. Unlike [[Retrieval-Augmented Generation (RAG)]], which adds information to the prompt, Unlearning removes information from the “baked-in” knowledge of the model.

The LoRA Approach (2026 Standard)

In 2026, LoRA (Low-Rank Adaptation) is the primary method for unlearning because it allows for surgical updates without the cost of a full re-train.

1. Gradient Ascent (GA)

Instead of minimizing the error to learn a fact, the model performs Gradient Ascent to maximize the error on specific target data.

  • Goal: To make the model “blind” to specific facts (e.g., PII, copyrighted data, or outdated project info).
  • Tooling: Training a LoRA adapter on a “Forget Set” using negative loss.

2. Selective Unlearning (LIBU/LoKU)

To prevent the model from becoming “stupid” (Utility Collapse), advanced methods protect critical neurons:

  • Fisher Information: Identifies neurons that must be kept intact.
  • Inverted Hinge Loss: Instead of random noise, it pushes the model toward the “second best” safe answer.

Use Cases

  • Privacy Compliance: “Right to be Forgotten” (GDPR).
  • Outdated Knowledge: Removing a model’s knowledge of a deprecated software version to prevent it from suggesting old code.
  • Safety: Removing hazardous knowledge (e.g., bioweapons, hacking) that was accidentally scraped during pre-training.

Unlearning vs. Prompting

FeatureSystem PromptingModel Unlearning
EffortLow (Text edit).Moderate (LoRA training).
ReliabilityLow (Jailbreakable).High (Mathematically suppressed).
ScalabilityLow (Context window limit).High (Baked into weights).

Sources

  • [[lora_unlearning_research_2026]] (Research Summary)