Friday, February 1, 2013

Spectral clustering with eigengap heuristic: A MATLAB implementation

In this post I will present a step-by-step tutorial on a basic spectral clustering algorithm and a simple  implementation in MATLAB.

The problem

The goal is given a set of $\rho$-dimensional data (say $n$ in total), to group the instances in such a way that points belonging in the same cluster are more similar to each other than to data points in other clusters.  There are many algorithms that can be used to complete this task, but here we will specifically focus on the spectral clustering approach.   The latter has become one of the most popular modern clustering approaches and has been shown to outperform more traditional clustering algorithms that are based on generative models (e.g., EM).  Such algorithms involve strong assumptions (e.g., that the data points in a cluster follow a multivariate Gaussian distribution) and the log likelihood can have many local minima, resulting in a bad clustering.  Intuitively, since as we will see spectral clustering works on the eigenvectors of a similarity matrix/network between the original data points, it can capture inherent patterns in the data that might not be possible otherwise.  In what follows we will present the steps followed by spectral clustering one-by-one.

Spectral clustering in a nutshell

Create a similarity graph between the data instances: Given our $n$ data points, say $\overrightarrow{x_1},~\overrightarrow{x_2},...,\overrightarrow{x_n} \in \Re^{\rho}$, we construct  a graph $G=(V,E)$.  Every vertex $v_i\in G$, is essentially one data point (e.g., $\overrightarrow{x_i}$).  In order to define the set of edges $E$ of this similarity graph, we essentially need to define a notion of similarity $s_{ij}$ between the vertices (i.e., data points) $v_i$ and $v_j$ (i.e., $\overrightarrow{x_i}$, $\overrightarrow{x_j}$).  Then the edges of the graph can be constructed in a variety of ways.  Here we review the so called, $m$-nearest neighbor similarity graph.  The goal is to connect $v_i$, with $v_j$ if the latter is among the $m$-nearest neighbors of $v_i$.  However, note that this can create directed graphs, since $v_i$ and $v_j$ might not be simultaneously to each others $m$-nearest neighbors.  For avoiding this problem, we connect $v_i$ and $v_j$ with an edge, if $v_i$ is among the $m$-nearest neighbors of $v_j$ or $v_j$ is among the $m$-nearest neighbors of $v_i$.  Note here, that the notion of $m$-nearest neighbors is related with the similarity metric $s_{ij}$.  Hence, if we denote with $N_m(i)$ the set of $m$-nearest neighbors of $\overrightarrow{x_i}$, the adjacency matrix ${\bf A}$ of $G$ is defined as:
                                                  $a_{ij} = \begin{cases}
                                                  s_{ij}+\alpha & \textrm{ if $v_i \in N_m(j) \lor v_j \in N_m(i)$} \\
                                                  0 & \textrm{ otherwise} \\
                                                  \end{cases}$

where $\alpha$ is a small constant that prevents degenerate instances not having connections to anyone else.  Depending on the specifics of the problem the parameter $\alpha$ might have a physical interpretation and/or it can be set to be equal to 0.

Compute the graph Laplacian: Once we have computed the adjacency matrix of the similarity graph we compute the graph Laplacian $L$ as:  $L = D-A$, where $D$ is a diagonal matrix with:
                                                 $d_{ii} = \sum_{j} A_{ij}$
Usually we work with a normalized version, $L_n$, of the graph Laplacian given by:
                                                 $L_n = D^{-1/2}LD^{-1/2}$

Compute the spectrum of the graph Laplacian and apply the eigengap heuristic: At this point we will use the eigenvalues of the graph Laplacian to first identify the number of clusters $k$ in our dataset.  Based on the eigengap heuristic, we sort the eigenvalues $\lambda_i,~i\in \{1,2,...,n\}$, of $L_n$ in ascending order and we pick $k$ as:
                                                $k=argmax_{i} (\lambda_{i+1}-\lambda_i)$

Dimensionality reduction: Using the eigenvectors $\overrightarrow{v_i}$, that correspond to the $k$ largest eigenvalues of $L_n$ we create matrix $U=[\overrightarrow{v_1} \overrightarrow{v_2}...\overrightarrow{v_k}]\in \Re^{nxk}$.  We further normalize the rows of $U$ so as to have unit norm.

Clustering step: Treating each row $u_j$ of $U$ as a datapoint in $\Re^k$, cluster them in $k$ clusters, $C_1, C_2,...,C_k$, using k-means.  This induces a clustering $A_1, A_2,...,A_k$, on the original data: $A_i=\{j|u_j \in C_i\}$.

If you need more information on the above spectral clustering algorithm, you can read Ng's, Jordan's and Weiss' seminal paper.  Von Luxburg's tutorial on spectral clustering covers some variations of the above algorithm.


The MATLAB code

The spectral clustering algorithm will take the following inputs: (i) an $n {\tt x} \rho$ matrix ${\tt data}$,  that includes the $\rho$-dimensional data we want to cluster,  (ii) the number of minimum and maximum clusters that we would like to have (${\tt k_{min}}$ and ${\tt k_{max}}$), (iii) the number ${\tt m}$ of nearest neighbors that we will use for creating the similarity graph and (iv) the small constant ${\tt \alpha}$.  If we do not want to impose limitations to the eigengap heuristic we can set ${\tt k_{min}} = 0$ and ${\tt k_{max}} = n$.  The MATLAB function will return a vector ${\tt labels}$, which has the cluster label for the corresponding data point and a matrix ${\tt C}$, which stores the centroids of the corresponding clusters.

function [labels, C]=spectral_clustering(data,k_min,k_max, m, a)


First, we create the $n{\tt x} n$ adjacency matrix ${\tt A}$ for the similarity graph between the data points.  We call MATLAB function ${\tt knnsearch}$, to find the ${\tt m}$ nearest neighbors of each data point.  We use the option 'Distance', which specifies the distance metric to use for the calculation.  We have set it to the cosine similarity distance but one can use its own distance function as well.  Once we have identified the ${\tt m}$ nearest neighbors of a given data point (i.e., a vertex of the similarity  graph),  we create the entries of the affinity matrix using the similarity metric (in our case the cosine similarity and the constant ${\tt \alpha}$:

for i=1:1:length(data(:,1))
        idx=knnsearch(data,data(i,:),'K',m,'Distance','cosine');
        for j=1:1:m
            A(i,idx(j))=dot(data(i,:),data(idx(j),:))/(norm(data(i,:),2)*norm(data(idx(j),:),2));
            if((A(i,idx(j)) == 0) || (isnan(A(i,idx(j)))))
                A(i,idx(j)) = a;
            end
            A(idx(j),i)=A(i,idx(j));
        end
end

We then need to calculate the similarity graph Laplacian $L$:
D=zeros(length(data(:,1)));
for i=1:1:length(data(:,1))
    D(i,i)=sum(A(i,:));
end
L= D-A;
L=D^(-1/2)*L*D^(-1/2);

Next, we calculate the spectrum of the Laplacian and sort the eigenvalues in ascending order:
[U,V]= eig(L);
v=diag(V);
[vs, is] = sort(v,'ascend');

Having the eigenvalues sorted, we need to apply the eigengap heuristic. We will use the vector ${\tt eigengap}$ to store the difference in the values of two consecutive eigenvalues of ${\tt A}$. We ignore the eigenvalues below $k_{min}$ and above $k_{max}$, since we want the number of clusters to be in this range. In other words we restrict the eigengap heuristic in the range $[{\tt k_{min},k_{max}}]$:
eigengaps = zeros(length(vs)-1,1);
for i=1:1:length(eigengaps)
    if ((i k_max))
        eigengaps(i)=-1;
    else
        eigengaps(i)=vs(i+1)-vs(i);
    end
end
[junk k] = max(eigengaps); 

Now we need to create a $n {\tt x} k$ matrix ${\tt u}$ where each column is one of the $k$ first eigenvectors of ${\tt A}$. We further need to normalize each rows to unit norm:
u=[];
for i=1:1:k 
    u = [u U(:,is(i))];
end
for i=1:1:length(u(:,1))
    u(i,:)=u(i,:)/sqrt(sum(u(i,:).*u(i,:)));
end

Finally, we cluster using k-means the rows of the matrix ${\tt u}$. In principle you could use any other clustering algorithm you want. In our code, we use the MATLAB function ${\tt means}$ with the option 'Replicates' set. This options essentially runs k-means as many times as specified and returns the best solution. Note here that since we apply k-means on the eigenvectors of similarity matrix ${\tt A}$, the centroids we obtain (i.e., ${\tt C_{spec}}$) are not the centroids of the clusters of the actual data points. Hence, we need to compute them separately:
[labels, C_spec] = kmeans(u,k,'Replicates',200);
C=[];
for i=1:1:k
    IDX=find(labels == i);
    tmp=mean(data(IDX,:));
    C=[C; tmp];
end  

You can download the m-file here.