Feature learning and "the linear representation hypothesis" for monitoring and steering LLMs

Data Science Seminar

Misha Belkin
UCSD

Abstract

A trained Large Language Model (LLM) contains much of human knowledge. Yet, it is difficult to gauge the extent or accuracy of that knowledge, as LLMs do not always "know what they know" and may even be unintentionally or actively misleading. In this talk I will discuss feature learning  introducing Recursive Feature Machines—a powerful method originally designed for extracting relevant features from tabular data. I will demonstrate how this technique enables us to detect and precisely guide LLM behaviors toward almost any desired concept by manipulating a single fixed vector in the LLM activation space.

Start date
Tuesday, March 18, 2025, 1:25 p.m.
End date
Tuesday, March 18, 2025, 2:25 p.m.
Location

Lind Hall 325 or via Zoom

Zoom registration

Share