# kmeans2

## PURPOSE

Fast version of kmeans clustering.

## SYNOPSIS

function [ IDX, C, d ] = kmeans2( X, k, varargin )

## DESCRIPTION

``` Fast version of kmeans clustering.

Cluster the N x p matrix X into k clusters using the kmeans algorithm. It
returns the cluster memberships for each data point in the N x 1 vector
IDX and the K x p matrix of cluster means in C.

This function is in some ways less general than Matlab's kmeans.m (for
example it only uses euclidian distance), but it has some options that
the Matlab version does not (for example, it has a notion of outliers and
min-cluster size).  It is also many times faster than matlab's kmeans.
General kmeans help can be found in help for the matlab implementation of
kmeans. Note that the although the names and conventions for this
algorithm are taken from Matlab's implementation, there are slight
alterations (for example, IDX==-1 is used to indicate outliers).

IDX is a n-by-1 vector used to indicated cluster membership.  Let X be a
set of n points.  Then the ID of X - or IDX is a column vector of length
n, where each element is an integer indicating the cluster membership of
the corresponding element in X.  IDX(i)=c indicates that the ith point in
X belongs to cluster c. Cluster labels range from 1 to k, and thus
k=max(IDX) is typically the number of clusters IDX divides X into. The
cluster label "-1" is reserved for outliers. IDX(i)==-1 indicates that
the given point does not belong to any of the discovered clusters. Note
that matlab's version of kmeans does not have outliers.

USAGE
[ IDX, C, d ] = kmeans2( X, k, [varargin] )

INPUTS
X       - [n x p] matrix of n p-dim vectors.
k       - maximum nuber of clusters (actual number may be smaller)
prm     - additional params (struct or name/value pairs)
.k         - [] alternate way of specifying k (if not given above)
.nTrial    - [1] number random restarts
.maxIter   - [100] max number of iterations
.display   - [0] Whether or not to display algorithm status
.rndSeed   - [] random seed for kmeans; useful for replicability
.outFrac   - [0] max frac points that can be treated as outliers
.minCl     - [1] min cluster size (smaller clusters get eliminated)
.metric    - [] metric for pdist2
.C0        - [] initial cluster centers for first trial

OUTPUTS
IDX    - [n x 1] cluster membership (see above)
C      - [k x p] matrix of centroid locations C(j,:) = mean(X(IDX==j,:))
d      - [1 x k] d(j) is sum of distances from X(IDX==j,:) to C(j,:)
sum(d) is a typical measure of the quality of a clustering

EXAMPLE