Abstract
Military training emphasizes experiential learning, such as situational judgements, training simulations, and wargaming scenarios. Due to the importance of spatial relationships and movement for courses of action, maps must be either created or found that stimulate specific training objectives (e.g., "a wet gap crossing approaching a large city"). Manually developing specialized maps is time-consuming, often contributing to the long development time for high-quality training scenarios. This work details the creation of TopoGen, a two-part dataset used to train generative models for this purpose. Diffusion models have emerged as a promising method for generating images from text prompts, and are trained on (image, description) pairs. However, traditional generative models cannot create high quality maps for this purpose. We use an off-the-shelf image captioning model, along with maps taken from the internet to create 110 (image, description) pairs needed to fine-tune the diffusion model. TopoGen also includes 7,867 (description, generated_image, bounding box) triplets for visual instruction tuning of multimodal models. Tests indicate that Google’s Dreambooth can produce convincing maps based on prompts with only 110 labeled examples. This finetuned diffusion model can create images using simple prompts describing geographic features that include the size of the map, relative locations of topography (e.g., hills, mountains), and water features (e.g., rivers, coasts). Once the images are generated, open vocabulary object detection models determine the location of these features using bounding boxes, to inform systems of where the specific key map features are located. TopoGen maps are relevant to a variety of training tasks, including practice items (e.g., route planning) and wargaming scenario documents. Future work will investigate extensions to generate terrain for interactive simulations or complementary generative AI techniques (e.g., generating scenario text materials), based on generated map layers.