Bioinformatics, Computational and Systems Biology
Comparing Predictive Machine Learning Models’ Capacity to Objectively Predict Dyspnea
Hannah M. Liu (she/her/hers)
Student
University of Pennsylvania
Lincolnshire, Illinois, United States
Hersh Sagreiya
Assistant Professor of Radiology
Hospital of the University of Pennsylvania, United States
Dyspnea is a symptom characterized by shortness of breath and is a major predictor of mortality for patients suffering from serious respiratory illnesses. It must be treated quickly, as failure to address it may lead to respiratory failure or death. Dyspnea is currently assessed via subjective numerical rating scales regarding their perceived level of breathing effort. However, due to the severity of dyspnea and its role as an indicator of the manifestation of a more serious respiratory condition, the ability to automatically and accurately estimate a patient’s breathing exertion levels would greatly benefit doctors in identifying dyspnea and effectively treating potential respiratory risks. By applying machine learning to this issue, dyspnea levels can be monitored in real-time and provide doctors with feedback immediately.
We obtained clinical data from a prospective cohort study where we collected cerebral hemodynamic changes and vital signs from COPD patients while they performed treadmill walking tests. These measurements served as our objective dyspnea scores and were captured via software that utilizes specialized cerebral sensors, signal processing, and predictive model algorithms. Additionally, we had the patients report their breathing exertion levels based on the Borg Rating of Perceived Exertion Scale (values go from 6 to 20) to get subjective dyspnea scores. Using this data, we developed and trained twenty-two machine learning algorithms to predict objective dyspnea scores based on cerebral hemodynamic measurements and vital signs. Additionally, we assessed each model’s accuracy by comparing their predictions to the patients’ reported subjected dyspnea scores.
We decided to utilize Decision Tree Regressors, Random Forest Regressors, XGBoost Regressors, CatBoost Regressors, and LightGBM Regressors in our study. When setting our prediction target, we chose to have the models predict the max Borg score given by the patient within each minute interval since these scores more accurately reflected a patient’s real-time dyspnea level. Pre-processing the data included creating two sets of features–one for cerebral hemodynamic changes and one for vital signs–and handling missing values within the data. Doing so will show us which set of measurements better predicted dyspnea and ensure the data is well-prepared to become validation and training data for the machine learning models.
For each type of algorithm, we implemented 5-fold cross-validation (k=5) via a pipeline and train test split validation. The only exception was the LightGBM algorithm, which was done with only train test split validation. Including both validation methods allow us to assess how well a model would perform within and beyond the training data. Additionally, for Decision Tree Regressors, we created models that were either specified with an optimal number of leaves for the algorithm or were set to a default value of leaves. These choices allowed us to assess whether this step helped improve accuracy within Decision Tree Regressor models.
To measure the accuracy of each model, we calculated the mean absolute error (MAE) between each model’s prediction values and the subjective dyspnea scores given by patients.
For both the cerebral hemodynamic (CH) and vital sign (VS) feature sets, the CatBoost regression algorithm and the Random Forest regression algorithm performed the best out of the twenty-two models created. The CatBoost yielded the lowest train test split validation method MAE (1.087 for CH, 1.143 for VS) and the second lowest cross-validation method MAE (2.551 for CH, 2.432 for VS) within both feature sets. The Random Forest algorithm yielded the second lowest train test split validation method MAE (1.144 for CH, 1.202 for VS) and the lowest cross-validation method MAE (2.369 for CH, 2.345 for VS) within both feature sets.
Additionally, cross-validation MAE scores for both feature sets are much higher than train test validation MAE scores, thus indicating that the models we created can improve in their application to unseen data. Having the cross-validation MAE scores alongside the train test split validation scores gives us a realistic impression of the performance of these models and a clear idea of our next steps.
Although more work can be done to improve the creation and testing of these models, the performance of the machine learning models we created establishes a strong foundation for more predictive model work focusing on objectively predicting dyspnea amongst patients experiencing respiratory illnesses. Given that the subjective Borg scale contains values that range from 6 to 20, having models display mean absolute error scores between 1-3 is a promising starting point for this research field. Our machine learning models can be further improved with techniques such as fine-tuning model parameters, applying dimensionality reduction to input variables, using correlation coefficients to assess model performance, and implementing leave-one-out cross-validation. Developing machine learning models that can quickly predict a patient’s dyspnea level with little error will allow doctors to treat serious respiratory illnesses with minimal turnaround time, and our research serves as a critical step forward in this domain.