AI Safety & Mechanistic Interpretability

Research notes and projects on AI safety, evaluations, goal drift, and mechanistic interpretability of model behavior.

An exploration of how moral fine-tuning changes LLMs

A summary of my takeaways from three influential articles on conducting effective empirical AI alignment research.

My notes and takeaways from the BlueDot AI Safety Evals paper club, covering recent papers on AI alignment, security, and evaluations.