Device Technologies and Biomedical Robotics
An Evaluation of Large Language Model Misinformation Responses
Jordan Rodriguez
Student
University of Arizona
Tucson, Arizona, United States
Katelyn Rohrer
Student
University of Arizona, United States
Camila Grubb
Student
University of Arizona, United States
Zachary Hansen
Student
University of Arizona, United States
Marvin J. Slepian
Regents Professor
University of Arizona, United States
Artificial intelligence provides great opportunity for providing additional insight and information regarding a given matter. In health and the biomedical space in particular, great interest exists in the utility of AI systems for providing new information of clinical benefit. New on the scene are Large Language Model (LLM) Generative AI systems (IE ChatGPT, Bing). While great excitement surrounds these systems, recent concerns have been raised regarding the possibility of internal hallucinations and generation of mis — if not dis — information. In this study, we examined the ability of an AI model to distinguish between fact and fiction in a variety of categories. Our hypothesis is that ChatGPT is vulnerable to affirming misinformation provided by users, particularly in domains for which it has little training data.
For the purposes of this experiment, we used the API for the LLM AI ChatGPT. Adversarial input testing was used to evaluate the vulnerability of the AI system to misinformation by subjecting the it to specially crafted inputs that are designed to provoke unexpected or incorrect responses. For this experiment, we generated a series of ground truth, and adversarial (plausible, but incorrect) inputs for the AI across a few problem domains and subjectively evaluated obscurity levels. The testing set consisted of 100 statements in the domains of medicine, mathematics/programming, and general trivia. Each statement fed to the model was given a subjective obscurity ranking of “High”, “Medium”, or “Low”, as well as a truth value of “True” or “False”. For most of the statements, ground truth and corresponding adversarial inputs were in pairs. Analysis was then conducted on the model’s ability to accurately distinguish truth from falsehood. The responses were analyzed using a sentiment-analysis model to determine if the model affirmed or rejected the offered claim, then verified manually in cases where the response may be ambiguous.
Results: The comparison between the expected truth values and the model’s evaluated response was used to generate the F1 score, a metric commonly used in binary classification tasks. F1 = TP / (TP + 0.5(FP + FN)), where TP is True Positives (the model affirmed true information), TN is True Negatives (the model rejected misinformation), FP is False Positives (the model affirmed misinformation) and FN is False Negatives (the model rejected true information). The overall F1 score for the dataset was 0.866. The model accurately affirmed true information ~91% of the time, and accurately rejected misinformation ~83% of the time. Each obscurity ranking of low, medium, and high had its own F1 score calculated independently: 0.91 for low, 0.87 for medium, and 0.81 for high. Beyond that, the F1 score for each category was calculated: 0.83 for trivia, 0.85 for mathematical/programming, and 0.91 for medical.
Conclusions: From this data we can conclude that although the model can generally distinguish between fact and falsehood in a variety of domains, there is an increased risk that it will accept and affirm misinformation as the information becomes more domain-specific. The model also performed worse in the “Mathematical” and “Trivia” categories than in the “Medical” category.
Discussion: This data demonstrates the tendency of LLM AI to occasionally affirm or generate factually incorrect information, particularly when prompted by the user with a false statement. The accuracy of ~83% in rejecting false information, compared with 91% for affirming true information, suggests that AI models have a greater tendency to agree with information fed to them, particularly in cases where there is less available information in its training set, such as the “High” obscurity level prompts. As AI systems are increasingly deployed in different contexts, it is crucial to understand this tendency, and carefully evaluate information outputted by AI language models for accuracy and safety.
This project was funded and supported by NIH R25 DK128859