Bioinformatics, Computational and Systems Biology
Big Data and Machine Learning Approach to Predict Diabetes Risk for Early Intervention
Stephanie Wang (she/her/hers)
UCRiverside Research Intern
Stanford University Online High School
Fullerton, California, United States
Boyue Wang
Data Scientist Intern
University of California, Riverside, United States
Jiayu Liao, n/a (he/him/his)
Professor
UCR
Riverside, California, United States
Diabetes is a chronic ailment that affects a significant proportion of the global population. The Centers for Disease Control and Prevention (CDC) Diabetes report reveals that the number of individuals with diabetes is estimated at 37.7 million, corresponding to 11.3% of the population in the US. The disease may affect several organs and tissues, causing numerous acute and chronic complications, even death after the COVID-19 infection. Early identification and prevention of diabetes during the pre-diabetes stage can significantly decrease the risk of developing the disease. The diagnosis and management of diabetes present notable challenges due to incomplete testing in most clinics. Machine learning algorithms, such as XGBoost, exhibit potential in distinguishing diabetes and non-diabetes patients through big data analysis algorithms. This study aims to evaluate the effectiveness of an XGBoost model in predicting the potential of diabetes and non-diabetes utilizing multi-omics data without clinical testing.
This study's multi-omics dataset was obtained from the West China Hospital (Chengdu, Sichuan, China), comprising clinical data for 140 individuals, encompassing diabetes and non-diabetes patients. The study was approved by the IRB committee of the West China Hospital. The data was obtained through Mass spectrometry for Proteomics, Lipidomics, Metabolomics, and GC-MS, resulting in 480 features in the Proteomics dataset, 735 features in the Lipidomics dataset, 1644 features in the Metabolomics dataset, 43 features in the GC-MS dataset, 2902 features in total. The data underwent preprocessing to eliminate any missing or inconsistent values, and subsequently, the dataset was partitioned using a 70:30 ratio for the purposes of training and testing. To enhance XGBoost performance, feature selection methods like RFE and CFS identified key variables for diabetes classification. The model was evaluated on a 30% testing dataset using accuracy, precision, recall, F1-score, and ROC curve area metrics. A confusion matrix was generated to visualize the classification results.
The XGBoost model demonstrates promising performance in distinguishing between diabetes and non-diabetes patients. The model achieved an accuracy of 81.25% and an area under the Receiver Operating Characteristic (ROC) curve (AUC) of 93.73%, indicating its effectiveness in accurately classifying patients based on their clinical data. The feature selection process identified a total of 32 features deemed most relevant in predicting patient outcomes, thereby reducing the complexity of the model without compromising its performance. These results suggest the potential of the XGBoost model as a valuable tool for early prediction and diagnosis of diabetes, paving the way for timely interventions and improved patient care. Moreover, the identified most significant features could provide insights into the underlying biological processes and contribute to a better understanding of the disease's pathophysiology.
This study demonstrated the effectiveness of the multiomics-based XGBoost model in predicting the risk potential of diabetes without clinical exams. The model's potential as a reliable approach for early prediction and diagnosis of diabetes, facilitating prompt interventions and enhanced patient outcomes, is evidenced by its high accuracy and AUC values.