This post discusses the implementations of K-Means in Ganitha, a scalding library recently open-sourced by Tresata. We’ll discuss the K-Means algorithm itself, and the efforts made by us to allow the algorithm to scale well on Hadoop with the number of clusters, k, using the K-Means|| implementation of Bahmani et al.
K-means clustering consists of partitioning data points into k ‘clusters’ where each point belongs to the cluster with the nearest mean. These points can exist in a Euclidean space, for example, or in a general vector space that might have hundreds of dimensions that could correspond to the frequency of words in a document. K-Means is a specific form of clustering analysis, where the goal is to partition a set of observations into a number of different groups that consist of points that are considered to be ‘close’ to one another. K-Means works by starting with an initial seed of cluster centers, and iteratively refining the centers of these groups in order to minimize the mean squared error between data points and their cluster centers.
The process of refining the centers of the clusters is commonly known as Lloyd’s algorithm, however there exist heuristic algorithms to seed the initial selection of cluster centers in order to improve the rate of convergence of Lloyd’s algorithm. K-Means++ offers an improvement over random initial selection, and more recently, K-Means|| offers an initialization technique that greatly cuts down on the number of iterations needed to determine initial clusters, a very desirable optimization in Hadoop applications, where significant overhead is involved in each iteration. K-means is by its very nature an iterative algorithm, and as such, there is a non-negligible amount of overhead from setting up each step in Map-Reduce. Though k-means|| is a great step towards cutting down on the number of steps, a tool like Spark which can cache datasets in memory across iterations, features a promising alternative to the Map-Reduce paradigm, and we have been exploring the use of Spark in implementing iterative algorithms at Tresata.
Ganitha provides an extensible interface for handling vector operations using different representations for data points, including Mahout vectors (which can contain categorical and textual features in addition to numerical values). The
VectorHelper trait can be used to specify how vectors are defined from the input and how distances are calculated between vectors. For example, the
MahoutHelper takes advantage of a useful sparse representation for vectors in Mahout which can include traits that are not only numerical, but categorical or textual in nature, which can be useful for performing cluster analyses on corpora that contain information for which it is not immediately obvious how to place data points inside a real vector space (a favorite example of ours at Tresata is classifying songs from a database containing album and artist metadata, genres, beats per minute, and various other traits). A good distance function to use in this case would be the Cosine difference, in lieu of a more conventional Euclidean norm. K-means in Ganitha (currently) reads vectors from Cascading Sequence files, and the algorithm writes a list of vectorid-clusterid pairs to a Tap, as well as a list of cluster ids with coordinates.
K-means isn’t without its limitations, however: one disadvantage to k-means is that you must specify the number of clusters as an input, and choosing a non-ideal value for k can result in substantially worse clusterings. K-means also works best for detecting clusters that are characterized by a Gaussian distribution of points; running k-means on a density-based clustering, for example, will yield undesirable results. Care should be taken to first decide if k-means is the best algorithm for the problem at hand, as using it makes some implicit assumptions about the distributions of data points. It’s probably a good idea to make sure you understand the data first before you decide if k-means is the best choice for cluster analysis on your data.
Ganitha (which can be found on Github here uses sbt for generating builds. To create a runnable jar distribution, download the source and run
sbt update and
sbt assembly within the ganitha project folder. Unit tests are included and can be run using
To run K-means clustering on a test set of data, stored as a comma-separated values file with a header (in this example, with a file on Hadoop named 100kPoints.csv with the header (
id,x,y), run the following command from within the ganitha directory:
hadoop jar ganitha-ml/target/scala-2.9.2/ganitha-ml-assembly-0.1-SNAPSHOT.jar com.twitter.scalding.Tool com.tresata.ganitha.ml.clustering.KMeansJob --hdfs --vecType StrDblMapVector --distFn euclidean --k 100 --id id --features x y --input 100kPoints.csv --vectors 100kVectors --vectorOutput vectorAssignments --clusterOutput centroids
This will use the
id columns as the vector id, and will encode the coordinates(
Map[String, Double] vectors (using the
StrDblMapVector VectorHelper), under a Euclidean space, and run the algorithm on k=100 clusters. The output is written to a
vectorAssignments file on Hadoop, with the cluster centroids written to
vectors argument specifies a location for the Cascading Sequence file that serves as the input for
Lloyd, S., “Least squares quantization in PCM”. IEEE Trans. Information Theory, 28(2):129-137, 1982.
Arthur, D. and Vassilvitskii, S. (2007). “k-means++: the advantages of careful seeding”.
Proc. ACM-SIAM Symp. Discrete Algorithms. pp. 1027–1035.
Bahmani, B. et al. (2012). “Scalable k-means++”. Proceedings of the VLDB Endowment, 5(7), 622-633.