Same Parts, Different Wiring: Mechanistic Interpretability of Moral Fine-Tuning
An exploration of how moral fine-tuning changes LLMs
Deep learning notes and projects covering neural networks, sequence models, language models, generative modeling, and interpretability.
Neural networks, mostly the language side of them. There's writing on search and semantic search, how language models work, a survey of text diffusion, and some interpretability work that looks at what a model has actually learned.
It overlaps a lot with the machine learning page. This is the part that's specifically about deep learning and language models.