A central challenge in the intelligence community is managing and effectively integrating large amounts of disparate information sources for concise presentation of knowledge to analysts. Currently, the high volume of incoming intelligence imposes a substantial burden on the analyst to understand the inconsistent, noisy data, potentially leading to missed intelligence about entities and their relationships.
To address this need, we have developed a data-driven approach that unifies disparate mentions to individuals and relationships to provide the analyst with an overview of the social network hidden in large, noisy databases. Our approach automatically discovers systems of related concepts structured in data to learn ontologies that are optimal for representing the knowledge encapsulated in the database. Taking advantage of recent advances in nonparametric Bayesian clustering (Kemp et al., 2006), the system analyzes streams of data to disambiguate references to the same entity and to identify groups of semantically related entities. The tool thus fuses knowledge across the datastore to create concise profiles of entities for use in analysis and an improved ontology for use in semantic search engines.
We evaluated our approach on operational sensor data collected during the JFCOM-sponsored Empire Challenge 2010 military training exercise. The EC10 dataset mirrors operational tactical intelligence datasets, and is characterized by a high level of sparsity and noise (missing and incomplete data, inconsistent manual coding). In preliminary experiments, our system produced high precision semantic clusters of entities by resolving disparate references to entities and uncovering hidden relationships. On the task of resolving entity references, compared with a baseline to k-means clustering algorithm, our approach yielded a 38% improvement in purity and a 6 % improvement in F-measure. These results indicate that our approach is better able to "connect the dots" across disparate documents to produce consolidated entity profiles than widely used clustering methods.