Model Unlearning & Editing

Overview

Model Unlearning (or Machine Unlearning) is the technical process of removing specific information from a model’s weights after the training phase. Unlike [[Retrieval-Augmented Generation (RAG)]], which adds information to the prompt, Unlearning removes information from the “baked-in” knowledge of the model.

The LoRA Approach (2026 Standard)

In 2026, LoRA (Low-Rank Adaptation) is the primary method for unlearning because it allows for surgical updates without the cost of a full re-train.

1. Gradient Ascent (GA)

Instead of minimizing the error to learn a fact, the model performs Gradient Ascent to maximize the error on specific target data.

Goal: To make the model “blind” to specific facts (e.g., PII, copyrighted data, or outdated project info).
Tooling: Training a LoRA adapter on a “Forget Set” using negative loss.

2. Selective Unlearning (LIBU/LoKU)

To prevent the model from becoming “stupid” (Utility Collapse), advanced methods protect critical neurons:

Fisher Information: Identifies neurons that must be kept intact.
Inverted Hinge Loss: Instead of random noise, it pushes the model toward the “second best” safe answer.

Use Cases

Privacy Compliance: “Right to be Forgotten” (GDPR).
Outdated Knowledge: Removing a model’s knowledge of a deprecated software version to prevent it from suggesting old code.
Safety: Removing hazardous knowledge (e.g., bioweapons, hacking) that was accidentally scraped during pre-training.

Unlearning vs. Prompting

Feature	System Prompting	Model Unlearning
Effort	Low (Text edit).	Moderate (LoRA training).
Reliability	Low (Jailbreakable).	High (Mathematically suppressed).
Scalability	Low (Context window limit).	High (Baked into weights).

Sources

[[lora_unlearning_research_2026]] (Research Summary)