Spring 2023 Data Science Poster Fair - Poster Details

Every year, data science M.S. students present their capstone projects during this event as a part of their degree requirements. 

The poster fair is open to the public and all interested undergraduate and graduate students, alumni, staff, faculty, and industry professionals are encouraged to attend.

Session 1 Presenters: 10-11am

Expand all

John Carruth

FedTopK: Secure Top-K Query Answering on Private Data Federation

Advisor: Chang Ge, Department of Computer Science & Engineering

Abstract: A Top-K query in relational databases returns the most important K tuples ranked by a certain scoring function. As organizations are generating enormous amounts of data, Top-K query answering often serves as an invaluable prerequisite step in data preparation, exploration, and many other data analysis tasks. To answer a Top-K query, a score is computed for each tuple based on the scoring function, which often aggregates over multiple attribute values, and then scores from different tuples are compared, such that tuples with top scores can be identified. In a private data federation, multiple parties own the partitions of the sensitive data with different access control, and hence efficiently answering the Top-K queries on the overall data while protecting data privacy becomes difficult.

In this work, we propose FedTopK, an uniform framework to enable secure Top-K query answering on both horizontal and vertical private data federations. In FedTopK, we formulate distributed Top-K query answering and design efficient cryptographic protocols to securely and efficiently compute and rank scores over sensitive tuples across different parties. The query performance of FedTopK will be compared and evaluated against non-secure distributed Top-K processing and general-purpose, multi-party computation frameworks. To the best of our knowledge, our work is the first one to tackle this problem.

Mohammed Guiga

Deep Learning Models for Measuring Mask-Wearing Behavior in Public Spaces

Advisor: Julian Wolfson, School of Public Health

Abstract: Advances in deep learning have brought the technology to a point of maturity where many pre-trained models exist for common tasks, such as object detection. As a result, smaller companies and industries who may have previously lacked the resources to invest in a machine learning department now have the ability to leverage this technology for their own benefit. To explore this further, this paper examines how the Department of Forest Resources, which may have traditionally been far away from software engineering and machine learning, could potentially use this technology to improve their operations. With the maturation of deep learning techniques, the department may now be able to leverage pre-existing models for tasks such as object detection and classification, which can have applications for forest conservation and management, as well as aiding public policy decision makers. The paper also explores the potential challenges and benefits of this approach. By leveraging pre-existing models, the Department of Forest Resources could gain a competitive edge while avoiding the significant investment of time and resources required to develop a machine learning department. The goal of this paper is to demonstrate the feasibility and potential benefits of leveraging deep learning models for practical applications in industries that previously lacked the resources to do so.

Mark Jokinen

Identifying At-Risk Students and Analyzing Achievement Decline with Causal Analysis

Advisor: Erich Kummerfeld, Institute for Health Informatics

Abstract: Schools often seek to improve student outcomes through identifying at-risk students as research suggests that timely interventions can more effectively mitigate negative outcomes when compared to interventions applied only after a student’s performance has declined. Unfortunately, while schools have resources in place to assist at-risk students, they often struggle to identify those students in a timely manner. In this project, we analyze a database of student records to identify previously high-achieving students who suffered a decline in academic achievement. Using Statistical Modeling and Causal Analysis, we isolate significant traits and utilize them to propose a warning system for identifying students at risk of achievement decline.

Navanshu Khare

Synergistic effects of PI and NPI for AD/ADRD

Advisor: Rui Zhang, Division of Computational Health Sciences, Medical School

Abstract: This capstone project aims to investigate the potential synergistic effects of pharmacological interventions (PI) and non-pharmacological interventions (NPI) in improving the lives of individuals affected by Alzheimer's Disease (AD) and age-related cognitive decline (ADRD). The project leverages Natural Language Processing (NLP) techniques, with a focus on Named Entity Recognition (NER) techniques, to extract events related to complementary and integrative health (CIH) therapies from clinical notes. The data used in this project is obtained from a cohort of individuals with AD.

The objective is to identify the usage of CIH therapies among AD patients and the symptoms for which they are using them. The project involves designing and conducting studies to collect data from clinical notes, which are then analyzed using NER techniques and statistical methods. The findings are presented in a clear and concise manner, with the ultimate goal of identifying the most effective combination of PI and NPI therapies to improve the lives of those affected by AD/ADRD.


The results of this research have significant implications for healthcare providers, policymakers, and the general public, as they could inform the development of novel interventions and the optimization of existing therapies to help mitigate the effects of AD/ADRD. By analyzing the extracted data, the project aims to gain insights into the prevalence and effectiveness of CIH therapies in managing the symptoms of AD, as well as any potential synergies between CIH and conventional therapies.


The project has the potential to inform the development of more effective treatment strategies for AD and improve our understanding of the role of CIH in the management of this condition. Ultimately, by uncovering the synergies between PI and NPI therapies, this research has the potential to significantly improve the lives of individuals affected by AD/ADRD and their families.

Anisha Khetan

Modeling SRT liver data using machine learning methods

Advisor: Sisi Ma, Institute for Health Informatics

Abstract: Liver transplant is a major procedure and is associated with a substantial burden on patients and society at large. However, the contributing factors to liver transplant success are not well understood. We aim to understand the contributing factors and also derive personalized medicine models for survival after a liver transplant. We analyze the SRTR dataset (which supports the ongoing evaluation of solid organ transplantation in the United States) with a primary focus on Liver transplantation containing 250,751 records. We derived a dataset containing 50 years of liver transplant data with 208,663 unique patients (currently above 18 years of age) having at max one transplant and with a previous history of either Kidney or Liver disease. We consolidated the categorization of the diagnosis (which could be misspelled or contains multiple diagnoses) into 30 major categories using NLP techniques. To characterize the liver transplant populations, we examined clinical characteristics of 4 subgroups of patients: Patients who have received a Liver Transplant and survived, Patients who have received a Liver Transplant and died within X years of the transplant, Patients who have been delisted from the waiting list, and Patients who are waiting for transplant. Further, we have built a predictive model for liver transplant survival for the population of patients that received a liver transplant.

Shashank Magdi

Causal And Predictive Analysis of Student Transitions & Performances ON Taxing High School Courses

Advisor: Erich Kummerfeld, Institute for Health Informatics

Abstract: Hopkins Public School District has shared data with UMN, consisting of school data from 17 graduating cohorts over their entire k-12 student careers at a student level granularity. The goal was to leverage the shared data to identify disparities in health and education outcomes contingent on measures like attendance, student behaviours, classes opted, and so on. This research study aims to investigate a specific sub-population and propose explicit interventions. The identified subpopulation consists of High School students in Grades 10 and 11. Navigating the middle phase of High School could prove to be arduous for students, due to possible increase in course difficulty and availability of a plethora of options. The objectives of the research study are: 1) To identify challenging courses for students in Grade 11, 2) Accurately predict Grade 10 student performances in the stimulating courses, 3) Identify students who would benefit from opting for the challenging courses versus those who would be better suited to less stimulating courses, 4) Pin down factors contributing to the performances, therefore suggest ways to improve the overall academic competency of students . The first objective involves prudent data preparation, followed by answering cross-domain questions using Exploratory Analysis. The second & third objectives entail the employment of predictive models aimed to discover the students accurately predicted to be at the greatest risk of negative impact. The final objective ascertains probable drivers and interventions for the identified at-risk students.

Kelsey Neis

An Analysis of Reader Engagement in Literary Fiction Through Eye Tracking

Advisor: Dongyeop Kang, Department of Computer Science & Engineering

Abstract: Capturing readers' engagement in fiction is a challenging but important aspect of narrative understanding. In this study, we collected 25 readers’ reactions to 2 short stories through eye tracking, sentence-level annotations, and an overall engagement scale survey. Our aim is to analyze the significance of various qualities of the text in predicting how engaging a reader is likely to find it. As enjoyment of fiction is highly contextual, we will also investigate individual differences in our data. Furthering our understanding of what captivates readers in fiction will help better inform models used in creative narrative generation and collaborative writing tools.

Minh Nguyen

Photovoltaic Electricity Generation Forecasting

Advisor: Jie Ding, School of Statistics

Abstract: Accurate solar power generation forecasting is essential for various stakeholders, including electric cooperatives, utility companies, and grid operators, to effectively manage their resources and facilitate the integration of renewable energy. Traditional methods have achieved reasonable success in short-term solar power generation predictions; however, longer-term forecasts remain a significant challenge. This study proposes a new approach utilizing the Teacher Forcing technique on Long Short-Term Memory (LSTM) models to predict solar power generation for the next 24 hours using historical generation data and weather information. The input parameters include measured historical solar radiation, temperature, humidity, air pressure, and active power data. Experimental studies are conducted using a photovoltaic power plant (PVPP) dataset from the Desert Knowledge Australia Solar Centre. The proposed LSTM model with Teacher Forcing is compared with benchmark deep learning methods. Performance metrics, including root mean square error (RMSE), are used to evaluate the accuracy and reliability of the models. The results demonstrate that our LSTM model with Teacher Forcing outperforms conventional forecasting methods, yielding more accurate and reliable long-term solar power generation predictions. The model is also tested on scenarios where weather information is given with certain degrees of accuracy, showing high prediction accuracy. Our findings hold significant implications for the energy sector, facilitating the integration of solar power into the grid and supporting efficient energy portfolio management strategies for various stakeholders.

Noah Rissman

Risk Factors for Chronic Absenteeism in the Hopkins School District

Advisor: Erich Kummerfeld, Institute for Health Informatics


Chronic absenteeism—defined by the US Department of Education as missing 15 or more days of school in an academic year—is an endemic problem in our nation’s schools. Students who are chronically absent from school are at serious risk of falling behind in their education, and the long-term impact on their development—both educational and social—can be devastating. Unfortunately, once a student has fallen into a pattern of absenteeism, it can be difficult to reverse the trend. Thus, early identification of chronic absenteeism warning signs is crucial if one is to intervene and ensure that a student does not succumb to this dangerous pattern.

Several factors may contribute to chronic absenteeism, such as a student’s health conditions, their academic performance, and a variety of personal circumstances. Moreover, an individual’s demographics and socioeconomic status may correlate with one’s likelihood of missing school. I implement machine learning and other data science techniques to identify the role these variables play in student attendance in Minnesota’s Hopkins School District, with the goal of identifying students at risk of falling into chronic absenteeism in the following year.

Nan Wang

Extracting SBDH concepts from clinical text

Advisor: Rui Zhang, Division of Computational Health Sciences, Medical School

Abstract: My project revolves around using Natural Language Processing (NLP) to extract Social and Behavioral Determinants of Health (SBDH), the goal is to have a model that can scan clinical text and identify all SBDH concepts that it finds. We've separated the project into five stages. First, we identify keywords to look for. Second, we extract all clinical notes from a dataset that has those keywords. Third, we annotate and refine the keywords until we're satisfied with our sample. Fourth, we train a model using our annotation, and finally, we visualize and report the results.

Destiny Ziebol

Synthetic raw EHR data generation with preserved causal structure

Advisor: Gyorgy Simon, Institute for Health Informatics

Abstract: Electronic Health Record (EHR) data is strongly protected through HIPPA, complicating the collection of large amounts of data needed for ML model development and observational analysis. This project seeks to directly synthesize raw EHR data, including vitals, diagnosis codes, and prescriptions, while preserving internal causal structures. The final result is data that is not HIPPA protected, is untraceable to any individual patient, and accurately portrays a local population while still capturing nuanced disease progression. These data can then be scaled to produce the large amounts of training data needed for modern model development while posing no safety risk to the original patient group(s) used to produce it. This project utilizes diabetes data provided from the Mayo Clinic and Fairview Clinic and takes a causal discovery and Bayesian modeling approach for sample generation.

Session 2 Presenters: 11am - 12pm

Expand all

Avinash Akella

Beyond Accuracy: Understanding user perception of diversity and serendipity in online movie recommenders

Advisor: Joseph Konstan, Department of Computer Science & Engineering

Abstract: Recommender systems are optimized to recover existing preferences. This work aims to shift the focus away from optimizing existing preferences, and attempts to understand how deeper contexts may result in more useful algorithms. By showing users recommendations from different algorithms, we want to evaluate how the benefits come across to users. We plan on showing recommendation lists from two algorithms at a time, asking them to do pair-wise evaluations, and seeing which one meets the criteria better. Hopefully, this can inspire future work on designing better experiences for human-in-the-loop systems.

SriHarshitha Anuganti

Causal analysis to investigate the development of ADRD in Bariatric surgery patients

Advisor: Rui Zhang, Division of Computational Health Sciences, Medical School

Abstract: Obesity is associated with multiple comorbidities and is a risk factor for many diseases. Numerous studies have demonstrated an association between obesity and increased cognitive impairments, decreased executive function, and increased rates of dementia, including Alzheimer’s disease. Bariatric surgery is effective to reduce weight and improve comorbid conditions such as diabetes, hypertension, and sleep apnea. However, the impact of bariatric surgery on long-term dementia incidence is unknown. This work investigates the effect of Bariatric surgery on Alzheimer's, Dementia, and related diseases(ADRD).

The data considered is from the Acute Care database. Covariance balance checking and Propensity score matching were conducted to balance and match similar subjects in the surgery and control groups. Cox proportional hazard model was chosen to calculate the hazard ratio in the outcome between the two groups - surgery and the control groups.

Aviral Bhatnagar


Advisor: Jaideep Srivastava, Department of Computer Science & Engineering


Raj Vaibhav Gude

Identifying factors affecting education outcomes in FRPL enrolled K-12 students

Advisor: Erich Kummerfeld, Institute for Health Informatics

Abstract: Although there are glaring ethnic, socioeconomic, and race based educational disparities in Minnesota, the state's schools do well overall. The government and schools constantly roll out new schemes and changes to shorten the achievement gaps with a goal to improve every students’ educational and health outcomes. Though majority of these programs have shown a beneficial impact, the scope to further enhance the desired outcomes is still large. Free or Reduced Price Lunch (FRPL) is one such program run by the government to provide breakfast and lunch benefits to students belonging to lower income buckets. The overall academic performance of students enrolled in FRPL program (42.3% of students as per the Department of Education 2023 report) is still significantly lesser than the students who are not enrolled in it. This brings the need to deep dive into the factors contributing to the education disparities in K-12 students with respect to FRPL enrollment.

Identifying these driving parameters and analysing them would help in narrowing the gap as it provides more light on the amenable factors. In this research I look into various metrics, test hypotheses, and run machine learning and causal models which help to better understand the outcome disparities. The work focuses more on how FRPL enrolled students perform in assessments and tests among each other and are also compared to non-enrolled students across grades and schools, and further analysing these comparisons across low-mid and high poverty schools. The project aims at identifying the determining attributes so as to be able to predict and avert the outcome disparities.

Silas Swarnakanth Kati

Understanding Institutional and Systemic Factors Contributing to Achievement Gaps in Education and Strategies for Mitigation

Advisor: Erich Kummerfeld, Institute for Health Informatics

Abstract: The aim is to explore the institutional and systemic factors contributing to educational achievement gaps in the Hopkins district’s public schools and propose strategies for mitigating these gaps. The study will analyze the complex interplay between various educational systems, institutional structures, and cultural norms that may create or perpetuate disparities in academic outcomes for students from different backgrounds. By understanding the root causes of achievement gaps, educators and policymakers can design more effective interventions promoting equity and closing gaps in educational attainment. The study will draw on a range of empirical evidence from quantitative and qualitative research to provide a comprehensive understanding of the issue and recommendations for action.

Rahul Mehta

Causal Discovery Analysis of Bipolar Disorder Patients

Advisor: Erich Kummerfeld, Institute for Health Informatics

Abstract: This abstract presents a study on causal discovery analysis using the GFCI algorithm in bipolar disorder patients. Bipolar disorder is a complex psychiatric disorder characterized by episodes of mania and depression. Identifying the causal relationships between different variables/symptoms can aid in the understanding of the underlying mechanisms of the disorder and potentially improve treatment outcomes. The GFCI algorithm is a non-parametric approach for causal discovery that can identify both linear and non-linear causal relationships between variables. GFCI is an algorithm that takes as input a dataset of variables and outputs a graphical model called a PAG, which is a representation of a set of causal networks that may include hidden confounders. The PAG that GFCI returns serves as a data-supported hypothesis about causal relationships that exist among the variables in the dataset. Such models are intended to help scientists form hypotheses and guide the design of experiments to investigate these hypotheses. As mentioned, GFCI does not presuppose that there are no hidden confounders. Our results show that the GFCI algorithm can identify several causal relationships among these variables, providing insights into the underlying mechanisms of the disorder. In this study, we apply the GFCI algorithm to a dataset of bipolar disorder patients manic and depressive symptoms as assessed in the Young Mania Rating Scale and the Montgomery Ashberg Depression Rating Scale. The identified causal relationships can help clinicians to develop more effective treatment plans tailored to individual patients.

Steven Moore

Stellar Nucleosynthesis

Advisor: Galin Jones, School of Statistics


Gavin Schaeferle

Identifying Unmet Needs for Phenotyping Using Deep NLP Algorithms

Advisor: Moein Enayati, Mayo Clinic

Abstract: Ensuring high-quality healthcare has always been a priority for health systems such as Mayo Clinic, which requires a timely and efficient method to identifying and providing necessary services to patients. This process is particularly of higher importance in detection and diagnosis of rare genetic diseases, where delayed diagnosis may result in serious health, mental and financial loss. In recent times, using AI to determine needs for rare services has been key to improving the timely identification of a patient's needs. Because of the nature of rare genetic diseases, there needed to be an AI method that could handle the complex nature of diseases that weren't identifiable through only structured data. This is due to the fact that a part of the genetic diagnosis process involves the use of clinical notes to determine a patients phenotypes. Because of the importance of clinical notes, we propose utilizing the recent developments into Natural Language Processing (NLP) to help predict the need for genetic testing. With the end result of a deplorable tool into the Mayo Clinic system for the purpose of identifying patients who would otherwise not have been seen by the clinical genomics department and lowering the false positive patients. We will use a multi-modal set of clinical information including clinical notes and EHR records, in developing a predictive model. We will develop a Deep Neural Networks (DNN) employing Attention and embedding with Bidirectional Encoder Representations from Transformers (BERT) and compare it to older approaches such as embedding text into DNN's, or Long Short Term Memory (LSTM's). Finally, we discuss whither this approach performs better, and more critically, if the resulting tool could be run in a real time environment for the use of predicting needs for clinical genomics services or if similar models perform well enough and are optimized enough for real time development.

Shifa Siddiqui

Developing a Comprehensive Knowledge Graph of Natural Medicines through Information Extraction with ChatGPT

Advisor: Rui Zhang, Division of Computational Health Sciences, Medical School

Abstract: Therapies that fall under the umbrella of natural medicine are becoming increasingly popular and are gaining popularity as an alternative to traditional medicine, yet there remains a need for greater understanding of their safety and efficacy. With the growing interest in accessing health information online, it has become increasingly important to provide accurate and reliable information to the public, as the ones available are of varying quality and scattered across different sources, making it challenging for individuals to access reliable information. This project aims to develop a knowledge graph base curated from the Therapeutic Research Center database, a trusted scientific resource. The project utilizes ChatGPT, a language model trained on a massive amount of text, to extract key concepts, entities, and relationships related to natural medicines. The extracted information is then structured into a knowledge graph, which can be used for various applications, such as drug discovery, personalized medicine, and treatment plans. By utilizing this, individuals can easily access high-quality information on the safety and efficacy of therapies, empowering them to make informed decisions about their health.

Xiaobing Wang

A Performance Evaluation of Group-Specific Recommender Systems

Advisor: Xiaotong Shen, School of Statistics

Abstract: A Group-Specific Recommender System is a system that recommends items to a group of users. In this project, I explored several Group-Specific Recommender Systems and gave evaluations of their speed and MSE on different datasets.