Graduate faculty research


Dr. Sara Algeri

The datasets generated from large astronomical surveys and ambitious experiments in physics have recently revealed the fundamental importance of statistics to:

  1. Conduct reliable and reproducible analyses.
  2. Handle a large amount of data “with care”, i.e., minimizing the risk of false discoveries while maximizing the power of the detection tools adopted.

As a result, astrostatistics and, more broadly, astrophysical data science, plays a fundamental role in the discovery of new phenomena. As an astrostatistician with a strong interest in statistical methodology, Dr. Sara Algeri’s work aims to develop generalizable statistical solutions that directly address fundamental questions in the physical sciences, and can at the same time be easily applied to any other scientific problem following a similar statistical paradigm. In line with this, motivated by problems arising in high energy physics and astronomy, her current research focuses on statistical inference for signal detection, background estimation, distributed learning, and uncertainty quantification.


When conducting searches for new astrophysical phenomena, modeling of the background distribution can dramatically compromise the sensitivity of the experiment. This figure shows a background calibration for a simulated observation by the Fermi Large Area Telescope. A functional parametric statistical model is used to account for data fluctuations due to the instrumental noise of the detector.

Dr. Jie Ding

Streaming data of a massive scale and heterogeneous nature is emerging in statistical and artificial intelligence practices, e.g., recordings from distributed sensor networks, transactions from e-commerce platforms, and media from mobile devices. These data often need to be analyzed in real-time due to limitations in decision time, hardware capacity, and communication bandwidth. Dr. Jie Ding’s recent research aims to address the following challenges. Application domains include real-time Cardiac Organoid Maturation, Human-Robot Teaming, Threat Detection, etc.

  1. The underlying data patterns often dynamically vary with time so that model-based time series analysis may require frequent re-modelings and back-testing. How to efficiently strike the most sophisticated tradeoff between overfitting and underfitting?
  2. Real-world data are often heterogeneous in its quality, modality, and even format, requiring appropriate Information Fusion. How to develop novel frameworks of collaborative learning to scale and robustify single-agent learning capabilities?
  3. In the context of streaming data and collaborative learning, privacy is an inevitable concern from both data providers and service providers’ perspectives. How to evaluate and optimize the privacy-utility tradeoffs?

Dr. Ding’s research in Assisted Learning aims to significantly enhance the learning ability of decentralized organizations by developing communication protocols, without sharing data, algorithms, or tasks, to secure proprietary information.

Dr. Charles Doss

Dr. Doss’ research focuses on foundations of statistics and data science.

Nonparametric regression and density estimation. In many contexts, especially with complex datasets, it is inappropriate or difficult to specify an overly simplistic parametric model. It is preferred to use so-called nonparametric techniques that are extremely flexible, can learn many and varied function shapes, and “let the data speak for themselves”. Dr. Doss works on studying such flexible nonparametric procedures and their properties.

Statistical inference, not just estimation. One of the fundamental requirements in science is to be able not just to provide estimates, but to provide uncertainty quantification, which we can do in terms of hypothesis tests or confidence intervals. Dr. Doss has developed confidence intervals/tests in nonparametric problems in which it is often challenging to conduct such tests.

Causal inference. “Correlation is not causation” is a commonly used phrase, but what is causation and how do we measure it? Causal inference combines statistical tools with a philosophical framework for what could have happened in an experiment under different possible “treatments” or interventions that did not actually happen. This is crucially important for observational studies, where correlations can be misleading and suggest incorrect relationships. Dr. Doss has worked on problems in causal inference, especially when the treatment/intervention is continuous.

Dr. Qian Qin

In statistics and many fields of science, one often needs to sample from an intractable probability distribution, e.g., a posterior distribution from a Bayesian model. Markov chain Monte Carlo (MCMC) is an extremely popular class of algorithms for this type of job. An MCMC algorithm simulates a Markov chain that converges to the desired distribution. The elements of the Markov chain are then used as an approximate sample from the limiting distribution. To ensure that the algorithm yields reliable results, it is important to understand how fast the underlying Markov chain converges. Dr. Qian Qin’s research focuses on the theoretical convergence analysis of MCMC algorithms. He is particularly interested in the convergence properties of MCMC algorithms that arise in Bayesian models associated with large and/or high-dimensional datasets.

Dr. Xiaotong T. Shen

Dr. Shen’s primary research interest is machine learning and data science, with applications in biomedical sciences and engineering. Currently, his group’s active research projects include:

  1. Causal discovery and inference. Causal relations, defined by the local Markov dependence, are fundamental to describe the consequences of actions beyond associations in science and medical research. For example, in gene network analysis, regulatory gene-to-gene relations are investigated to unravel the genetic underpinnings of disease, where latent confounders such as race and family relatedness could introduce spurious or missed associations in gene expression levels. The research question is how to identify and infer causal relations in the presence of confounders, nonlinearity, and interventions.

  2. Numerical embeddings, language modeling, and generative models. Sentence generation creates representative examples to interpret a learning model as in regression and classification. For example, representative sentences of a topic of interest describe the topic specifically for sentence categorization. The research question focuses on the generation of a description of the underlying learning task to bridge the gap between structured and unstructured data.

  3. Inference for a black-box learner. Explainable artificial intelligence demands interpretability and understanding of features of interest in addition to predictive accuracy. This is critical to a deep neural network. The research focus is on hypothesis testing for feature relevance to prediction.


Dr. Joseph Konstan

Joseph A. Konstan conducts research on human-computer interaction and social computing systems with a particular focus on recommender systems. Much of his research involves exploring how to design recommender systems and algorithms that use available data about user behaviors and preferences to make recommendations that go beyond simply predicting what users will like to achieving other goals such as broadening user experiences, increasing long-term value, or minimizing wasted user time. Most of Prof. Konstan's work involves experimentation with actual system users, generally online.

Dr. Shashi Shekhar

Billions of people around the globe use various applications of spatial computing daily—by using a ride-sharing app, GPS, the e911 system, social media check-ins, even Pokémon Go. Scientists and researchers use spatial computing to track diseases, map the bottom of the oceans, chart the behavior of endangered species, and create election maps in real time. Drones and driverless cars use a variety of spatial computing technologies.

Spatial computing works by understanding the physical world, knowing and communicating our relation to places in that world, and navigating through those places. It has changed our lives and infrastructures profoundly, marking a significant shift in how we make our way in the world. Even more compelling opportunities lie ahead. Dr. Shekhar’s research investigates the technologies and ideas behind current and future spatial computing technologies. Examples include GPS and location-based services, including the use of Wi-Fi, Bluetooth, and RFID for position determination out of satellite range; remote sensing and Geo-AI, which uses satellite and aerial platforms to monitor such varied phenomena as global food production, the effects of climate change, and subsurface natural resources on other planets; geographic information systems (GIS), which store, analyze, and visualize spatial data; spatial databases, which store multiple forms of spatial data; and spatial statistics and spatial data science, used to analyze location-related data.

[1] Spatial Computing, S. Shekhar and P. Cold, MIT Press Essential Knowledge Series, 2020.
[2] Spatial Computing, S. Shekhar, W. Are and S. Feiner, Communications of the ACM, 59(1):72-81, January 2016.

Dr. Ju Sun

Dr. Ju Sun’s group builds foundations and tools for making sense of data. The group’s recent efforts are focused on deep learning, which fuels the ongoing artificial intelligence revolution. They create robust deep learning techniques to enable reliable image recognition, develop faster numerical methods for performing learning with massive datasets, and revamp deep learning to tackle major unsolved scientific and engineering problems. The group applies these novel techniques and tools to unravel the mystery of high energy particles, depict the interior structures of physical and biological samples, and empower smart scooters that can travel safely with the assistance of a cheap onboard camera. Dr. Ju Sun’s group is especially fascinated by the prospect of transforming healthcare and medicine using artificial intelligence and data science. They have been working closely with medical researchers to tame brain tumors, fight COVID-19, and improve trauma and critical care.


Dr. Nathaniel Helwig

As a joint appointment in Psychology and Statistics, Dr. Helwig's research is broadly situated at the intersection of multiple fields but is connected through a common theme. The students in Dr. Helwig's lab focus on the development of statistical learning methodology and open-source software for analyzing various types of multivariate and functional data collected in the psychological sciences. Dr. Helwig works to promote the use of open-source and nonparametric methods and has interests in both refining and advancing the theory, computation, and application of statistical methods within the fields of psychology and neuroscience.

Recent applied projects:

  • Neural correlates of psychological disorders
  • Early diagnosis of autism spectrum disorders
  • Perceptions of dynamic facial expressions
  • Distinguishing abnormal locomotion patterns
  • Threat generalization in soldiers with PTSD

Recent computational and theoretical projects:

  • Nonparametric inference for nonparametric regression
  • Robust tuning methods for tensor product smoothers
  • Cross-validation and model selection in regression
  • Robust randomization tests for multivariate data
  • Constrained multivariate least squares problems
Dr. Dongyeop Kang

Dr. Kang is passionate about developing human-centered language technologies. The goal of his research is to develop interdisciplinary methods for Natural Language Processing (NLP) models and to build interactive NLP systems for scientists and creative writers. The Minnesota NLP team, an interdisciplinary team, draws from computational linguistics and cognitive sciences, develops state-of-the-art machine learning algorithms, and validates their robustness and practicality to support human-computer interaction.

The Minnesota NLP group has developed a data- and human-centric ML annotation system for creating a more robust and dynamic benchmark set. Data were calibrated for their informativeness using measurements of model uncertainty, variability, and other training dynamics, and the data were annotated jointly in collaboration between the model and human annotators. With users and data dynamics in the loop, we will be able to develop a more human-centric machine learning pipeline.

Dr. Sisi Ma

Dr. Sisi Ma’s primary research interest is the application of statistical modeling, machine learning, and causal analysis methods in the field of biology and medicine. The questions she seeks answers to include how to leverage big data and analytical approaches to:

  1. Diagnose and prognose disease and disorders earlier and more accurately.
  2. Systematically and efficiently identify potential treatment targets for a given disease.
  3. Identify the best treatment for a particular patient.

She also works on theoretical aspects of predictive modeling and causal modeling.

Dr. J. Ilja Siepmann

Research in Dr. Ilja Siepmann’s laboratory focuses on the development of Monte Carlo algorithms, molecular mechanics force fields, high-performance computing software and workflows, and machine learning approaches to understand complex chemical systems, and to design systems and processes for chemical separations. The Siepmann group collaborates extensively with experimentalists.

Current research areas include:

  1. Nanoporous materials for energy-efficient chemical separations and adsorption cooling;
  2. Self-assembly of asymmetric shape-filling block oligomers; and
  3. Multi-phase flow in aqueous systems.

Within the data science domain, research projects focus on:

  1. Machine learning models that convert discrete adsorption data at specific state points into continuous adsorption surfaces for multi-component mixtures that are not well described by analytical adsorption equations.
  2. Machine learning approaches to predict single- and multi-component adsorption and the spatial distribution of adsorbate molecules from knowledge of only the atomic positions of the nanoporous material.
  3. Vision-based machine learning models for classification of liquid-crystalline or locally ordered phases of complex block oligomers.
Dr. Julian Wolfson

Humankind's ability to collect data is far outpacing its ability to make sense of it. Dr. Wolfson's research focuses on developing tools for quantifying relationships, identifying patterns, and making predictions using complex biomedical data sources. Often, this involves "remixing" modern statistical and computational techniques (e.g., causal inference, dimension reduction, supervised and unsupervised machine learning) to create methods that better account for the key features of real-world data. He has developed novel techniques for a wide range of biomedical data science problems, including:

  + Predicting the risk of cardiovascular disease using electronic health records;

  + Identifying and characterizing human activity patterns based on smartphone sensor data;

  + Aggregating data from multiple sources to make individual-level inferences and predictions;

  + Characterizing how individuals respond differently to treatment; and

  + Evaluating the fairness of clinical risk prediction models.

Dr. Wolfson is engaged in a wide range of collaborative biomedical data science projects with researchers in disciplines including nutrition, pediatrics, cardiology, infectious disease, and emergency medicine. He also serves as co-lead (w/ Dr. Rui Zhang) of the Innovative Methods and Data Science unit within the Center for Learning Health System Sciences, focusing on the creation of new approaches to analyzing large-scale electronic health data.

Dr. Rui Zhang

Dr. Zhang’s research focuses on the development of novel natural language processing (NLP) methods to analyze biomedical big data, including published biomedical literature, electronic health records (EHRs), and patient-generated data from millions of patients. In particular:

  1. The secondly analysis of EHR data for patient care.
  2. Pharmacovigilance knowledge discovery through mining biomedical literature.
  3. Creation of knowledge base through database integration, terminology and ontology.

Current projects in Dr. Zhang’s lab include:

  1. Developing NLP methods and applications to extract information from clinical reports.
  2. Mining biomedical literature to discover novel drug-supplement interactions through genetic pathways.
  3. Repurposing existing drugs for COVID-19 treatment through link predictions and literature-based discovery.
  4. Developing computational methods to predict personalized cancer treatment caused cardiotoxicity in EHRs.
  5. Developing conversational agent for consumers with developed knowledge base.