UMN Machine Learning Seminar: The Polyak-Lojasiewicz condition as a framework for over-parameterized optimization and its application to deep learning

The UMN Machine Learning Seminar Series brings together faculty, students, and local industrial partners who are interested in the theoretical, computational, and applied aspects of machine learning, to pose problems, exchange ideas, and foster collaborations. The talks are every Thursday from 12 p.m. - 1 p.m. during the Fall 2021 semester.

This week's speaker, Mikhail Belkin (University of California San Diego), will be giving a talk titled "The Polyak-Lojasiewicz condition as a framework for over-parameterized optimization and its application to deep learning."

Abstract

The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. In this talk I will discuss some general mathematical principles allowing for efficient optimization in over-parameterized non-linear systems, a setting that includes deep neural networks. I will discuss that optimization problems corresponding to these systems are not convex, even locally, but instead satisfy the Polyak-Lojasiewicz (PL) condition on most of the parameter space, allowing for efficient optimization by gradient descent or SGD. I will connect the PL condition of these systems to the condition number associated with the tangent kernel and show how a non-linear theory for those systems parallels classical analyses of over-parameterized linear equations. As a separate related development, I will discuss a perspective on the remarkable recently discovered phenomenon of transition to linearity (constancy of NTK) in certain classes of large neural networks. I will show how this transition to linearity results from the scaling of the Hessian with the size of the network controlled by certain functional norms. Combining these ideas, I will show how the transition to linearity can be used to demonstrate the PL condition and convergence for a general class of wide neural networks. Finally I will comment on systems which are ''almost'' over-parameterized, which appears to be common in practice.

Biography

Mikhail Belkin received his Ph.D. in 2003 from the Department of Mathematics at the University of Chicago. His research interests are in theory and applications of machine learning and data analysis. Some of his well-known work includes widely used Laplacian Eigenmaps, Graph Regularization and Manifold Regularization algorithms, which brought ideas from classical differential geometry and spectral analysis to data science. His recent work has been concerned with understanding remarkable mathematical and statistical phenomena observed in deep learning. This empirical evidence necessitated revisiting some of the basic concepts in statistics and optimization. One of his key recent findings is the "double descent" risk curve that extends the textbook U-shaped bias-variance trade-off curve beyond the point of interpolation. Mikhail Belkin is a recipient of a NSF Career Award and a number of best paper and other awards. He has served on the editorial boards of the Journal of Machine Learning Research, IEEE Pattern Analysis and Machine Intelligence and SIAM Journal on Mathematics of Data Science.

Start date
Thursday, Sept. 23, 2021, Noon
End date
Thursday, Sept. 23, 2021, 1 p.m.
Location

Share