Model Unlearning & Editing
Model Unlearning & Editing
Overview
Model Unlearning (or Machine Unlearning) is the technical process of removing specific information from a model’s weights after the training phase. Unlike [[Retrieval-Augmented Generation (RAG)]], which adds information to the prompt, Unlearning removes information from the “baked-in” knowledge of the model.
The LoRA Approach (2026 Standard)
In 2026, LoRA (Low-Rank Adaptation) is the primary method for unlearning because it allows for surgical updates without the cost of a full re-train.
1. Gradient Ascent (GA)
Instead of minimizing the error to learn a fact, the model performs Gradient Ascent to maximize the error on specific target data.
- Goal: To make the model “blind” to specific facts (e.g., PII, copyrighted data, or outdated project info).
- Tooling: Training a LoRA adapter on a “Forget Set” using negative loss.
2. Selective Unlearning (LIBU/LoKU)
To prevent the model from becoming “stupid” (Utility Collapse), advanced methods protect critical neurons:
- Fisher Information: Identifies neurons that must be kept intact.
- Inverted Hinge Loss: Instead of random noise, it pushes the model toward the “second best” safe answer.
Use Cases
- Privacy Compliance: “Right to be Forgotten” (GDPR).
- Outdated Knowledge: Removing a model’s knowledge of a deprecated software version to prevent it from suggesting old code.
- Safety: Removing hazardous knowledge (e.g., bioweapons, hacking) that was accidentally scraped during pre-training.
Unlearning vs. Prompting
| Feature | System Prompting | Model Unlearning |
|---|---|---|
| Effort | Low (Text edit). | Moderate (LoRA training). |
| Reliability | Low (Jailbreakable). | High (Mathematically suppressed). |
| Scalability | Low (Context window limit). | High (Baked into weights). |
Sources
- [[lora_unlearning_research_2026]] (Research Summary)