Beyond Sub-Gaussian and Independent Data in High Dimensional Regression [thesis]

Author

Vidyashankar Sivakumar (Ph.D. 2020)

Abstract

The past three decades has seen major developments in high-dimensional regression models leading to their successful use in applications from multiple domains including climate science, finance, recommendation systems, computational biology, signal processing to name a few. The underlying assumption in high-dimensional regression models is that the phenomenon under study can be explained by a simple model with few variables. In high-dimensional parametric regression models with parameters existing in high-dimensional space, the simplicity assumption is encoded by a sparsity constraint to be satisfied by the parameter vector. Statistical analysis of high-dimensional regression models delves into the study of the properties of the models, including how faithfully the models recover the assumed true sparse parameter and model sensitivity to different data assumptions. While major progress has been made over the past several years, non-asymptotic statistical analysis of high-dimensional regression models still makes standard data assumptions of (sub)-Gaussianity and independence which do not hold in some practical applications. For example, datasets in climate and finance are known to have variables with heavier tails than Gaussian or bandit algorithms have data that is sequentially chosen thus violating the independence assumption. The topic of this thesis is the non-asymptotic statistical analysis and study of high-dimensional regression estimators under non-standard data assumptions, including analysis of traditional estimators like regularized least squares as also design of new algorithms to improve estimation performance. Our technical results highlight geometric properties of high-dimensional models and hence all results are expressed in terms of geometric quantities associated with the sparsity structure assumed for the parameter. Much of the analysis borrows tools and techniques from random matrix analysis, probability tools like generic chaining and, in general, probability results for behavior of random variables, vectors in high-dimensional space. We analyze four problems: 1. Regularized least squares with sub-exponential data: Data in multiple domains like finance, climate science are known to be sub-exponential, which have probability distributions with tails heavier than Gaussians but dominated by a suitably scaled centered exponential distribution. We study non-asymptotic estimation performance of the regularized least squares estimator with i.i.d. sub-exponential data showing that the estimation performance is slightly worse compared to the i.i.d. sub-Gaussian setting. 2. High-dimensional quantile regression: We study the quantile regression problem in high dimensions which models the conditional quantile of a response given covariates. While least squares regression is ideal to model the conditional mean of a response variable which is symmetric (sub)-Gaussian, there are multiple applications where it is imperative/of interest to model conditional quantiles of the response given covariates to completely understand the behavior of the conditional response, e.g., in meteorology there is more interest in modeling the extremes of climate variables like precipitation rather than the mean. We show that sample complexity for parameter recovery for quantile regression is the same as the least squares loss and the estimation is robust to heavy tailed/outliers in response conforming with traditional wisdom. 3. Unified analysis of robust high-dimensional semi-parametric single index models: High-dimensional semiparametric single index models are a tradeoff between linear parametric models, which is too restrictive in many applications, and traditional non-parametric regression which cannot be used in high-dimensions due to the curse of dimensionality. We unify the analysis of multiple existing estimators for high-dimensional single-index models, highlight their strengths and weakness and also propose new estimators which simultaneously overcome highlighted negatives of the previously proposed estimators. We also study the non-asymptotic estimation performance of high-dimensional quantile single index models. 4. Smoothed analysis of high-dimensional structured stochastic linear bandits: We analyze the performance of the greedy algorithm under a `smoothed' framework for the stochastic linear bandits problem assuming the parameter has sparsity inducing structure. The smoothed setting has adverserial contexts perturbed by independent Gaussian noise. Due to the sequential nature of the algorithm, novel analysis and results are required since the rows of the design matrix are not independent and can have arbitrary correlations. We show there is implicit exploration in the smoothed setting where a simple greedy algorithm works precluding the exploration for exploration strategies for high-dimensional linear bandit problems!

Link to full paper

Beyond Sub-Gaussian and Independent Data in High Dimensional Regression

Keywords

data mining, high dimensional regression

Share