IMA Data Science Seminar: Clustering High-dimensional Data with Path Metrics: A Balance of Density and Geometry
Data science seminars hosted by the The Institute for Mathematics and Its Applications (IMA) take place on Tuesdays from 1:25 p.m. - 2:25 p.m.
This week, Anna Little (The University of Utah), will be giving a talk titled "Clustering High-dimensional Data with Path Metrics: A Balance of Density and Geometry".
Registration is required to access the Zoom webinar.
This talk discusses multiple methods for clustering high-dimensional data, and explores the delicate balance between utilizing data density and data geometry. I will first present path-based spectral clustering, a novel approach which combines a density-based metric with graph-based clustering. This density-based path metric allows for fast algorithms and strong theoretical guarantees when clusters concentrate around low-dimensional sets. However, the method suffers from a loss of geometric information, information which is preserved by simple linear dimension reduction methods such as classic multidimensional scaling (CMDS). The second part of the talk will explore when CMDS followed by a simple clustering algorithm can exactly recover all cluster labels with high probability. However, scaling conditions become increasingly restrictive as the ambient dimension increases, and the method will fail for irregularly shaped clusters. Finally, I will discuss how a more general family of path metrics, combined with MDS, give low-dimensional embeddings which respect both data density and data geometry. This new method exhibits promising performance on single cell RNA sequence data and can be computed efficiently by restriction to a sparse graph.