Speaker: Dr. Giuseppe Di Fatta
Title: Efficiency and scalability of the k-Means clustering method
Abstract: One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Clustering is a classical unsupervised machine learning problem of the identification of groups of similar objects within a set, which is particularly useful in typical exploratory data mining applications. The scalability of the algorithms in terms of the input objects, the number of features and the number of clusters is crucial for very large sets of data. Techniques for improving the efficiency of k-Means have been largely explored in two main directions, sequential optimisation techniques and parallel processing. In the first case, the amount of computation in the sequential algorithm can be significantly reduced by adopting geometrical constraints, efficient data structures (e.g. multi-dimensional binary search trees) and partitioning strategies. This talk provides insights on the efficiency and scalability of the k-Means clustering method for both sequential and parallel implementations. In particular, it presents some recent work which improves the performance of the algorithm for very large data sets (millions of patterns) with high numbers of features and clusters.