The rapid deployment of Large Language Models (LLMs) like OpenAI's ChatGPT across various sectors has underscored the urgent need for robust verification and validation techniques. These models have a vast comprehension of both linguistic nuance as well as any tasks contained within their training dataset. The ability of LLMs to be proficient in a variety of tasks makes them desirable, but there is a critical need to verify and validate these models due to their questionable reliability and widespread use. This paper focuses on the verification and validation of models that produce source code to accomplish the task instructed by a prompt provided by the user.
One of the key challenges of LLMs is how to verify and validate the output of these models. Modern LLMs rely on deep neural networks to transform the input prompt into an output. It is nearly impossible to explain the output of a sufficiently complex neural network due to its complex, deep architectures with billions of parameters and non-linear operations across high-dimensional spaces. This lack of explainability requires the use of vigorous verification and validation methods to ensure confidence in the models.
This paper will present the results of evaluating the code generation capabilities of the last-generation and current state-of-the-art LLMs and see how they compare against human programmers at a variety of skill levels. Both the models and human programmers will be evaluated by a set of hidden unit tests, face validation by a group of subject matter experts, and an adversarial machine learning model trained to recognize machine-generated programs. Each of these methods will be used to gauge the current capabilities of LLMs used to generate source code and to demonstrate their use in performing verification and validation of LLMs in general.
Keywords
MACHINE LEARNING;STATISTICS;VERIFICATION, VALIDATION AND ACCREDITATION (VV&A)
Additional Keywords
Large Language Models, Generative AI