Exploiting Low-Dimensional Data Structures and Understanding Neural Scaling Laws of Transformers

Data Science Seminar

Wenjing Liao
Georgia Tech

Abstract

When training deep neural networks, a model’s generalization error is often observed to follow a power scaling law dependent on the model size and the data size. Perhaps the best-known example of such scaling laws is for transformer-based large language models (LLMs), where networks with billions of parameters are trained on trillions of tokens of text. A theoretical interest in LLMs is to understand why transformer scaling laws exist. To answer this question, we exploit low-dimensional structures in language datasets by estimating its intrinsic dimension, and establish statistical estimation and mathematical approximation theories for transformers to predict the scaling laws. By leveraging low-dimensional data structures, we can explain transformer scaling laws in a way which respects the data geometry. Furthermore, we test our theory with empirical observations by training LLMs on language datasets and find strong agreement between the observed empirical scaling laws and our theoretical predictions.

Start date
Tuesday, Feb. 25, 2025, 1:25 p.m.
End date
Tuesday, Feb. 25, 2025, 2:25 p.m.
Location

Lind Hall 325 or via Zoom

Zoom registration

Share