Spring 2024 Data Science Poster Fair - Poster Details

Every year, data science M.S. students present their capstone projects during this event as a part of their degree requirements. 

The poster fair is open to the public and all interested undergraduate and graduate students, alumni, staff, faculty, and industry professionals are encouraged to attend.

Session 1 Presenters: 10-11am

Expand all

Jashwin Acharya

Use of a large language model for few-shot learning to predict dementia

Advisor: Wei Pan, School of Public Health

Abstract: This study explores the usage of TabLLM, an open-sourced Large Language Model (LLM), that can be utilized for converting a tabular Dementia Diagnosis dataset into prompts by defining specific prompt templates. Once the prompts are generated, the open-sourced T-Few source code can be used for fine-tuning a T0-3B LLM model quickly on the generated prompts in a few-shot setting. My analysis revealed that TabLLM achieves comparable Test AUC performance on the dementia diagnosis task when compared against baseline Logistic Regression, Decision Tree and Transformer-based models in a 128-shot, 256-shot and 512-shot setting, and also remained competitive with the UK Biobank Dementia Risk Score (UKBDRS). My study highlights the potential of LLMs for Dementia Diagnosis, and also proposes future research directions for using much larger LLMs that could potentially perform better on the Dementia Diagnosis task.

Aviral Bhatnagar

Genome Sequencing

Advisor: Jaideep Srivastava, Department of Computer Science and Engineering

Abstract: TBD

Jiahao He

Identifying Health Condition Factors that Impact K-12 Education Outcomes

Advisor: Erich Kummerfeld, Institute for Health Informatics

Abstract: With the development of education, people have started to emphasize the importance of students’ health. Certain medical conditions might negatively affect students’ achievement, either in the field of education or otherwise. In this context, it is crucial to identify the impact of health condition factors in the context of K-12 education.

In this paper, I analyze several medical conditions, including ADD/ADHD and autism spectrum disorder, as well as their correlation with students’ education outcomes. The main measurement of such outcomes is GPA. This study aims to understand the connection between these medical conditions and student outcomes, especially focusing on students in Hopkins School District.

Jooyong Lee

Exploring Health-Related Determinants of Student's Academic Performance: A Causal Inference Approach Using the DoWhy Python Library

Advisor: Erich Kummerfeld, Institute for Health Informatics

Abstract: Investigating the determinants of social disparities is significant in comprehending societal dynamics and contributing substantially to resolving diverse issues within a community. In education, extensive research has explored the causal effects of students' backgrounds, encompassing factors such as family size and income, on academic disparities. 

For a more comprehensive understanding of the causal effect on students' academic performance at school, research should also focus on students' social background and various factors of students, such as health status. Therefore, this project focuses on describing the causal effects of health factors on educational disparities among students.

This project focused on utilizing the DoWhy Python library, which aims to spark causal thinking and analysis, on student data Hopkins Public School District has shared with UMN for the causal inference experiment. Using Dowhy packages, the project estimates the causal effect of a selected feature on the students' score using the Backdoor, Frontdoor, and Instrumental methods because such methods can give an estimated causal effect by blocking the cofounder and avoiding having a spurious correlation. Moreover, experiment results could vary depending on how the causal graphical model is defined; therefore, this project aims to experiment with various causal graphical models constructed with steps: Step 1. Discovering candidate causal graphs consistent with the dataset. step2. inspecting, editing, and modifying the graph manually to match world knowledge.

Hahnemann Ortiz

Convergence of AI and DLT

Advisor: Daniel Boley, Department of Computer Science and Engineering

Abstract: First described by Nathan Rosenberg in the 1960s, convergence describes a phenomenon in which two or more initially separate items move toward unity and become increasingly integrated. An example is Bitcoin, created using techniques from various computer science domains such as distributed systems, cryptography, security, and game theory. I predict that Artificial Intelligence (AI) and Distributed Ledger Technology (DLT) will eventually converge. In this work, I explore that possibility by building on the efforts of Satoshi Nakamoto’s implementation of DLT (Bitcoin), Princeton’s BlockSci blockchain analysis platform, and Judea Pearl’s do-calculus framework for insights into complex causal relationships within the entire history of Bitcoin data.

Jong Inn Park

Graphical Text Summarization Using Generative AI

Advisor: Dongyeop Kang, Department of Computer Science and Engineering

Abstract: This work proposes an innovative end-to-end approach to summarize and visualize transcribed text data from speeches, such as meeting notes, which are often unstructured and multidimensional. Leveraging advancements in Automatic Speech Recognition (ASR) and Generative AI, this work aims to transform long, text-based summaries into structured, graphical visualizations, thus enhancing accessibility and comprehension. Traditional text summaries, while organized, fail to offer an immediate understanding of the key points and topic structure of speeches. Our method employs ASR technology, notably OpenAI's Whisper, to transcribe spoken content into text, which is then processed using various summarization modes customized to the content's nature—such as Q&A, timelines, and topic clustering. These summaries are enriched with additional information and structured to highlight significant content, intending to facilitate a deeper and quicker comprehension through graphical representation. This approach aims to bridge the gap in current speech summarization tools by providing a visual summary that can significantly improve user engagement and understanding, especially in contexts like meetings or Q&A sessions where multiple topics and speakers are involved.

Hari Veeramallu

Study the feasibility of generating a top-down view of an Underwater Robot given an input stream from n RGB camera sensors

Advisor: Junaed Sattar, Department of Computer Science and Engineering

Abstract: Autonomous robots are being used in increasingly diverse environments for a range of use cases, facilitated partly by the rapid advances in AI technologies, reducing human effort and risk to human lives. Recently, Autonomous Underwater Vehicles (AUVs) have seen numerous deployments for various tasks ranging from underwater debris cleanup to marine biology research. One of the key challenges to remotely navigating or trajectory planning for AUVs is that the underwater environment poses unique challenges to visibility and station keeping, making AUV deployment particularly difficult. To overcome these challenges, a top-down view (Bird’s Eye View) representation of the robot can be generated using the RGB camera inputs from a stereo camera setup on the AUV to provide a clear notion of the surrounding environment. This paper proposes an end-to-end architecture based on Pyramid Stereo Matching Network (PSMNet) and Lift-Splat-Shoot (LSS) to provide a Bird’s Eye View (BEV) representation using inputs from an RGB stereo camera. This work closely relates to the LSS architecture and can be modified in the future to handle n-camera inputs to generate a more precise and accurate BEV representation. 

Tianhong Zhang

Comparative Analysis of Deep Learning and Stacking Methods for Link Prediction in Network Data

Advisor: Tianxi Li, School of Statistics

Abstract: In this capstone project, I will explore how deep learning and stacking methods perform in predicting links within various types of networks. 

Session 2 Presenters: 11am - 12pm

Expand all

Venkata Sai Krishna Abbaraju

Reviving lost data: Applying ML to impute missing data in factory datasets

Advisor: Jaideep Srivastava, Department of Computer Science and Engineering

Abstract: Large volumes of data are produced during manufacturing operations, and these data are essential for assuring product quality and maximizing production efficiency. On the other hand, incomplete or missing values in industrial datasets might compromise the precision and dependability of data processing and judgment. Using existing imputation techniques that are customized to the unique features of manufacturing processes, this work tackles the issue of missing value imputation in manufacturing data. Our objective is to optimize the handling of missing values in manufacturing datasets by the application of machine learning, statistical approaches, and domain expertise. This will lead to better data quality and increased predictive modeling performance. The results of this study have important ramifications for streamlining production procedures and guaranteeing product quality.

Dinesh Reddy Challa

Influence of Snowfall on the Fuel Consumption of Winter Maintenance Vehicles

Advisor: William Northrop, Department of Mechanical Engineering

Abstract: Winter maintenance vehicle fuel consumption increases with snowfall due to changes in road conditions and driving behavior. Quantifying fuel use is important for estimating costs and for understanding the impact of snow-clearing operations on the environment, thus enabling trans-portation departments to focus on areas that can contribute towards their sustainability goals.

Calculating fuel economy is challenging in snowplows because recorded onboard diagnostics (OBD) data often do not include mass, which can fluctuate significantly when applying de-icing substances to the road. This report outlines a novel method to isolate fuel usage associated with snowfall, accounting for gross vehicle weight, using OBD data. For days with snowfall totaling 2 inches or more, fuel use rose about 16.5% to 22.9% as compared to days without snowfall. Fully loaded trucks were found to use 13.2% to 18.5% more fuel than half-loaded trucks. The results could be used in practice by motivating better fleet management strategies, for example by minimizing mass, or to model the feasibility of snowplow electrification.

Amrutha Shetty Jayaram Shetty

Bridging AI Dimensions: Small Model Precision Meets Large Model Depth in Therapy

Advisor: Dongyeop Kang, Department of Computer Science and Engineering

Abstract: The goal of the project is to compare the performance of a small-scale model trained on carefully curated patient simulation data against a larger, more general model. Our technical efforts include the careful curation and collection of high-quality simulation data derived from the seed data comprising real time dialogues between counselors and patients. The system works on a fine-tuned taxonomy, delineating four primary stress types —physical, psychological, psychosocial, and psychospiritual—along with their corresponding subcategories, while employing skillful triggering prompts to enhance both self-awareness and patient engagement. Performance will be evaluated using custom metrics that are tailored to our specific objectives, allowing for a comprehensive comparison of the two models. Innovative strategies are implemented to improve the overall efficiency of the system.

Rahul Mehta

Transdiagnostic causal models of relationships among manic and depressive symptoms in mania, depression, mixed state, and euthymia

Advisor: Erich Kummerfeld, Institute for Health Informatics

Abstract: This study explores the causal relationships among manic and depressive symptoms in different mood states of bipolar disorder (BD), including mania, depression, mixed state, and euthymia. Traditional causal inference methods are limited in addressing this complex question, necessitating novel approaches. Leveraging network theory and computational causal modeling, specifically causal discovery modeling (CDM), we aim to elucidate the intricate interactions among symptoms and their potential causal influences. Our analysis incorporates data from 10 NIMH-funded studies, comprising nearly 6000 BD patients. We investigate variations in the causal structure across mood states, identify symptoms with the greatest causal influence, and examine the impact on critical outcomes such as suicidality and aggression. The study hypothesizes differences in network connectivity between mood states, emphasizing the centrality of mood and energy symptoms in shaping the causal landscape. Findings hold implications for refining therapeutic strategies, informing diagnostic criteria, and enhancing our understanding of the pathophysiology of BD.

Sam Penders

LIGO All-Sky Long-Duration Transient Search Using Deep Learning

Advisor: Vuk Mandic, School of Physics and Astronomy

Abstract: Gravitational waves--ripples in spacetime caused by accelerating astrophysical objects--were first measured directly by the The LIGO Scientific Collaboration in 2015 from the in-spiral and merger of a binary black hole system. Now, the search for transient, long-duration gravitational waves with less well-defined waveforms is one of the most important research areas in gravitational wave astronomy. In this work, I train deep neural networks, including Res-UNet and Res-UNet++, to recover the waveforms of long-duration gravitational wave events in spectograms of simulated, noisy, strain data from the LIGO gravitational wave detectors. These networks may be implemented in the LIGO gravitational wave search pipeline to detect new events.

Eric Trempe

Predicting Patient Cancer Types Through Medical Measures

Advisor: Tianxi Li, School of Statistics

Abstract: Different cancer types require different treatment types in order to avoid relapses. If the wrong treatment is given to a cancer patient they are more likely to have a relapse and continue their struggle with cancer. Early diagnosis of which cancer type a patient has can allow them to get the proper treatment and therefore make them less likely to have a relapse. In this project we use 25 observations of medical data from 4 different stages of patient checkups to predict whether or not the patient will have a cancer relapse. Since there are hundreds of variables of few observations and a substantial amount of missing data, we impute missing values using Principal Component Analysis. This method proved more effective than imputing using time based means and patient based means in this data. We develop statistical and machine learning models to predict whether or not a patient would relapse, including logistic regression, random forests, boosted trees, and more. We use “leave one out validation” and compare the model results to determine the best model. Finding an effective model would allow medical care teams to identify patient cancer types at earlier stages and therefore reduce patient relapses due to incorrect treatments.

Keith Willard

Using BART generative synthetic data to improve BERT parsing of patient prescription instructions.

Advisor: Xiaotong Shen, School of Statistics

Abstract: Ambiguity of clinician-authored patient medication instructions in electronic prescriptions is an ongoing source of clinician/pharmacist miscommunication which has patient safety implications. The National Council for Prescription Drug Programs has established a structured sig standard that allows a combination of discrete category and numerical value fields to represent the human authored sentence but it is provided in less than 25% of electronic prescriptions. A BERT model was trained to predict the key values of the structured sig from the human authored sentence using a 25 million row dataset derived from electronic prescriptions transactions. In 7 of 10 categories the model achieved >99% accuracy on a separate test set--a substantial improvement over traditional NLP techniques. To improve accuracy on lower performing categories, training data was augmented with generatively produced sigs using a BART seq2seq model trained to produce a synthetic sig sentence from the structured sig. The model for the worst performing category was trained against a dataset augmented using these synthetic sentences. The model trained on the BART augmented data achieved an accuracy of 97.15% when evaluated on the test set, an improvement over the baseline trained model of 95.55% (all models trained at 50 epochs) and a control model trained with a dataset augmented by simple duplication of the discrepant data which improved to 96.60%. This demonstrates the potential role of appropriately synthetically augmented training data for improving language model performance in this domain. As all models were still (slightly) improving at 50 epochs, they did not achieve the highest performance possible but practical limits on training costs limited further investigation. 

Linjun Xia

A Correlation and Causality Study of Student Behavioral Conditions with Health and Achievement

Advisor: Erich Kummerfeld, Institute for Health Informatics

Abstract: The objective of this investigation was to examine the relationship and potential causality between student academic performance and mental health within the Hopkins School District. Utilizing a cross-temporal panel dataset, alongside assessments of psychological well-being and academic performance records, this research offers a comprehensive evaluation of a demographically diverse student body in the Hopkins School District. Initially, the study quantified correlations between academic achievements and indicators of mental health using Spearman rank correlation coefficients. Mental health metrics were derived both from directly provided data and indirect indicators based on student behaviors, including chronic absenteeism and illness. Subsequently, the investigation delved into the causal impact of mental health status on academic performance, employing an instrumental variables strategy alongside a two-way fixed effects model. This approach meticulously accounted for potential confounding variables such as socioeconomic status, educational environment, and ethnic background, inferred from student behavior and demographic characteristics. The findings reveal a significant association between mental health and academic performance, demonstrating that mental health status exerts a considerable adverse effect on educational outcomes, even when adjusting for a variety of confounders. This research underscores the critical need for enhancing student mental health as a strategy to bolster academic achievement and furnishes an empirical foundation for educational institutions and policymakers to craft effective mental health interventions.

SriHarshitha Anuganti

Development of dementia in patients who underwent bariatric surgery

Advisor: Rui Zhang, Department of Surgery

Abstract: Obesity is associated with multiple comorbidities and is a risk factor for many diseases. Numerous studies have demonstrated an association between obesity and increased cognitive impairments, decreased executive function, and increased rates of dementia, including Alzheimer’s disease. Bariatric surgery is effective to reduce weight and improve comorbid conditions such as diabetes, hypertension, and sleep apnea. However, the impact of bariatric surgery on long-term dementia incidence is unknown. This work investigates the effect of Bariatric surgery on Alzheimer's, Dementia, and related diseases(ADRD). 

The data considered contains patients who have undergone bariatric surgery from the Acute Care database. Covariance balance checking and Propensity score matching was conducted to balance and match similar subjects in the surgery and control groups. Cox proportional hazard model was chosen to calculate the hazard ratio in the outcome between the two groups - surgery and the control groups.