We describe a system for estimating pixel heights from a single multispectral RGB image with or without sensor metadata information. System components include an ensemble of multiple convolutional-deconvolutional neural network (CNN) models and an optimization function. The chosen deep learning network model per pixel is validated using high resolution aerial RGB imagery and lidar data sets. A data knowledge base provides historic time stamped multi-modality data for registration and 3D feature classification. Given a large amount of elevation truth data, a model is trained to recognize image features of differing heights using CNN image to lidar regression. The models, when applied to an unseen image, estimate a preliminary height per pixel, based on learned feature set. Multiple models are created and trained end-to-end and best model and result are determined. We use linear programming optimization with an ensemble of regression models and semantic segmentation information with a CNN classification model to decide optimized pixel height estimates. Semantic segmentation data sets help classify RGB imagery with feature class labels and refine land use feature classification with CNN classification to improve accuracy. Each land use classified feature can be weighted with a confidence metric which is used to help determine height information. Therefor we use CNN regression for preliminary height estimation and CNN classification for land use feature classification plus a linear programming reward matrix per pixel to automatically decide optimized height estimation. The rows in reward matrix contain CNN regression model results from image to lidar regression, while columns contain CNN classification model results from RGB imagery. An updated volumetric knowledge base contains the system output and is further used for change detection and situational awareness. Both qualitative and quantitative analysis are performed and visualized.