Abstract
The exact amount the Department of Defense (DoD) spends on software sustainment each year isn’t known, but a conservative estimate based on past budgets puts the figure close to $10 billion. Writing proper integration tests can significantly reduce these sustainment costs by reducing the number of software bugs while increasing software reliability and usability. However, writing proper integration tests is time consuming and tedious. Libraries already make the writing and running of integration tests easy, but it’s still a human that must manually comb over and understand the codebase well enough to write the test logic.
Recent advancements in the AI/ML space show that large language models (LLMs) have the potential to write high quality test code, but there have been relatively few successful real-world efforts to automate the writing of integration tests with LLMs mentioned in literature.
Our study addresses a critical gap by applying an LLM-RAG pipeline specifically to integration test scenarios within an application-specific, GUI-based Python testing framework. Our small-scale real-world evaluation demonstrates a sixfold reduction in user task completion time for generating Python integration test code compared to traditional manual methods with 96% of the generated code snippets containing zero errors. We also show how the size of the knowledge base in relation to the codebase affects the quality of the downstream code generation. This work highlights practical considerations for integrating LLM-RAG solutions into existing software development workflows, addressing common challenges such as building an effective knowledge base. Ultimately, this research offers valuable guidance to the community, illustrating how to leverage advanced AI technology to accelerate software delivery, enhance reliability, and enable engineering teams to refocus on innovation and strategic tasks.