Clustering of image data is the process of sorting images into groups that exhibit similarities. During the clustering process, images are reduced into feature vectors and the statistics of their features are used for placing them into statistically similar groups. Systems that collect images during their operation (e.g. autonomous ground vehicles, satellites, etc.) create large data sets that ideally need to be sorted with minimal human effort. K-means is one of the most widely used methods for automatically sorting images. However, it is heavily influenced by initializations, the most important one being the need to know the number of clusters a priori. In order to overcome the latter shortcoming, validity indices have been used throughout the years to find the optimal number of clusters the data should be separated in. The work presented in this paper comprises an Extension to the Variance Ratio Criterion (E-VRC) that when combined with the K-means can cluster image data of high content variance, without the need to input any information like the number of expected clusters, thus, it operates in an unsupervised manner. Comparisons with other available unsupervised methods (i.e. X-means, U-K-Means, and attractive-repulsive clustering) is discussed in order to demonstrate the superior performance of the new E-VRC method. Several image datasets are used in the comparative studies. The robustness of the E-VRC method is also demonstrated by processing datasets with imbalances in their contents (i.e. many more images from certain clusters compared to the rest clusters) and by processing mixed datasets (i.e. comprised by very diverse types of images). It is demonstrated that the E-VRC does not dependent on initializations, does not care about the data dimensionality nor the content randomness and it is therefore a great tool for efficiently estimating the number of clusters and performing the clustering of image data.
Keywords
MACHINE LEARNING
Additional Keywords
unsupervised learning, image clustering