Abstract
In only a handful of years, large language models (LLMs) have become an integral part of modern society. These models are used daily for a variety of tasks, from generating cooking recipes to conducting PhD-level research. With the advent of open-source, pre-trained models, it is increasingly compelling to integrate this technology into our systems. However, the reliability of these models is questionable due to LLMs tendency to hallucinate or censor information when confronted with data outside of its training dataset or that conflict with internal biases. While refinement and assessment of these models on ground truth examples can mitigate these issues, additional challenges are faced when the ground truth is unknown. Without proper tools to evaluate LLM trustworthiness under uncertainty, the risks posed by their integration into Department of Defense (DoD) systems are unacceptable due to their lethality, scale, expense, and criticality. The risks are further complicated because many off-the-shelf LLMs only have inherent knowledge of publicly available information, so prompting for controlled information often results in hallucinations.
In our I/ITSEC 2024 presentation, “Mapping Trust in AI: Right Tool, Right Task,” we proposed a novel trustworthiness metric and presented a methodology to analytically compute and visualize the degree of trust placed across an array of AI model predictions. In continuation of this work, we have extended our research to evaluate the trustworthiness of LLMs. Particularly, this study focuses on the evaluation of these models in ground truth-agnostic environments. By utilizing an ensemble of local LLMs, we create a trustless system inspired by blockchain consensus mechanisms capable of evaluating response trustworthiness under uncertainty, returning the most trustworthy response. Our experiments showcase the ability of our proposed system to identify trustworthy and untrustworthy behavior in order to mitigate risk and increase LLM adoption.