Leveraging Machine Learning to Predict Inherited Variants Associated with Chronic Lymphocytic Leukemia

Student

Raphael Mwangi

Advisor

Cavan Reilly

Abstract

This project aims to implement advanced machine learning approaches to genome-wide association studies (GWAS) in order to identify novel susceptibility variants. The project proposes integrating machine learning algorithms to predict the risk of chronic lymphocytic leukemia (CLL) by modelling single nucleotide polymorphisms (SNPs). A two-step approach is implemented by integrating feature selection algorithms and classification algorithms to quantify CLL's risk in the cohort. The project's objective is first to obtain a group of interacting SNPs that have a high predictive potential to the risk of CLL by implementing ensemble methods random forest (RF), extreme gradient boosting machine (XGBoost), and light gradient boosting machine (LightGBM) algorithms. The second step implements a support vector machine (SVM), regularized logistic regression (RLR) on the selected SNPs to classify CLL cases and controls. The feature selection approach showed that LightGBM outperformed both RF and XGBoost and the feature importance score indicated that LightGBM was robust in selecting SNPs with high-risk predictive potential for CLL. Algorithm comparisons show that integrating LightGBM with SVM obtains a higher AUC of 63.4% with rbf kernel function than the baseline SVM AUC of 60.4%, while integrating LightGBM with RLR obtains a higher AUC of 63.6% with elastic-net penalty term than the baseline RLR AUC of 59.4%. The current analytical paradigm for GWAS is to evaluate each SNP with disease risk. Although this method has successfully identified susceptibility SNPs across cancers, it is limited by not utilizing the many possible SNP-SNP multi-way interactions. This study is unique as it is the first study that attempts to predict CLL risk by integrating machine learning algorithms, which augments traditional methods.

Video

Leveraging Machine Learning to Predict Inherited Variants Associated with Chronic Lymphocytic Leukemia