Feature learning and "the linear representation hypothesis" for monitoring and steering LLMs
Data Science Seminar
Misha Belkin
UCSD
Abstract
A trained Large Language Model (LLM) contains much of human knowledge. Yet, it is difficult to gauge the extent or accuracy of that knowledge, as LLMs do not always "know what they know" and may even be unintentionally or actively misleading. In this talk I will discuss feature learning introducing Recursive Feature Machines—a powerful method originally designed for extracting relevant features from tabular data. I will demonstrate how this technique enables us to detect and precisely guide LLM behaviors toward almost any desired concept by manipulating a single fixed vector in the LLM activation space.
Category