Over the past decade, we have seen moderate demand for simulation-based training systems to include automatic speech recognition (ASR). Like commercially available services such as Apple's Siri and Google Now, ASR gives training systems the capability to interpret human speech and react to that speech with appropriate actions (e.g. executing a spoken command) and responses (e.g. replying to a human with confirmation or requests for clarification). Introduction of this capability is designed to address instructor-manning limitations and improve the fidelity of the training experience. However, ASR successes within simulation-based training systems have been modest, historically. We contend that this lack of widespread usage and success stems primarily from a fundamental misunderstanding of (and thus lack of investment in) the components necessary to achieve more effective ASR. In this paper, we describe the essential functions of ASR: (1) Recognition is when the audio of the spoken utterance is translated into text. (2) Understanding attempts to glean meaning from the text – whether they denote, for example, a new directive, a response to a previous query, or a request for new information. (3) Behavior refers to the functions the system is responsible for after receiving a recognized speech utterance. (4) Some training systems also employ dialogue when continuous interaction with humans is required. Finally, we outline current ASR research and development, discuss typical implementations, and introduce potential strategies to improve specific ASR functions and the capability as a whole to provide better support for future training systems.