Can We Trust LLM-Generated Code? A Quantitative Verification Study

Bearss, Edwin

Abstract

Recent advances in generative AI have positioned Large Language Models (LLMs) as potent tools for a variety of tasks, yet their widespread adoption raises important questions about reliability, quality, and trustworthiness. Building on previous work presented at I/ITSEC in 2024, which introduced verification and validation methods for code-generating LLMs, this paper significantly expands the dataset and the evaluation techniques. The limited sample size of the previous study has been expanded to over 10,000 code samples, including human-written code from public GitHub repositories and outputs from three distinct LLMs solving a suite of interview challenges.

This expanded dataset enables a more rigorous quantitative assessment of these models. Established code quality metrics, such as cyclomatic complexity, Halstead complexity, and maintainability index, are used to compare and contrast human-generated code with their LLM-generated counterparts. Additionally, an analysis of each program’s abstract syntax tree was performed to uncover structural patterns uniquely characterizing LLM-generated code. Multiple machine learning models were also trained to distinguish LLM-generated code from human-written code, with a primary focus on explainability. By highlighting the most salient features guiding classification decisions, these models provide deeper insight into the elements that differentiate LLM-generated programs from those authored by humans.

This paper presents the results of this analysis and illustrates how LLM-generated code differs from human-written code, both in structural complexity and in measurable software engineering attributes. In addition to offering deeper insight into the reliability and clarity of LLM-generated code, these findings have broader implications for automated code review, compliance audits, and best practices in AI-assisted programming workflows. By systematically identifying and quantifying the strengths and weaknesses of LLM-generated code, this work enhances the understanding of how to effectively integrate AI into modern software development while preserving critical standards of quality, security, and maintainability.

Keywords

AGILE SOFTWARE DEVELOPMENT;MACHINE LEARNING;VERIFICATION, VALIDATION AND ACCREDITATION (VV&A)

Menu

Can We Trust LLM-Generated Code? A Quantitative Verification Study

Abstract

Keywords