Same Parts, Different Wiring: Mechanistic Interpretability of Moral Fine-Tuning
An exploration of how moral fine-tuning changes LLMs
Research notes and projects on AI safety, evaluations, goal drift, and mechanistic interpretability of model behavior.