Many popular clustering algorithms require a priori knowledge of the number of clusters, e.g., k-means and spectral clustering. This requires you to already know some things about the data (which we often don't) or that you have to try to learn things about that data which becomes intractable for large, unstructured, and high-dimensional data. Nearly all popular clustering algorithms use distance as the metric – though not inherently bad all the time – it can introduce hyperparameters like distance thresholds that again, require a priori knowledge about the data.
This paper introduces Correlated Histogram Clustering (CHC) which requires no a priori knowledge for the number of clusters and assumes nothing about the magnitude of values in any dimension. Designed to handle large, unstructured, high-dimensional, and noisy data, CHC leverages probabilistic techniques to build density estimates rather than using distance metrics. CHC uses the lowland modality algorithm to determine the modes of each dimension and then correlates the modes with points in the original dataset to form a cluster centroid. This cluster centroid may then be used for training, thereby substantially reducing the amount of data needed for supervised learning.
Sorting the significant few data values from the insignificant many in training data using CHC can be used to transform training syllabi by identifying which elements in a syllabus correlate with real skill attainment, and which elements do not accelerate skill attainment. Additional benefits of applying CHC to large, unstructured, high-dimensional, noisy data include dimension reduction and an understanding of the modal nature of the data. In one supervised learning classification application, twenty-seven features were reduced to only three, an 89% reduction of the dataset.