In this post I will present a step-by-step tutorial on a basic spectral clustering algorithm and a simple implementation in MATLAB.
The problem
The goal is given a set of ρ-dimensional data (say n in total), to group the instances in such a way that points belonging in the same cluster are more similar to each other than to data points in other clusters. There are many algorithms that can be used to complete this task, but here we will specifically focus on the spectral clustering approach. The latter has become one of the most popular modern clustering approaches and has been shown to outperform more traditional clustering algorithms that are based on generative models (e.g., EM). Such algorithms involve strong assumptions (e.g., that the data points in a cluster follow a multivariate Gaussian distribution) and the log likelihood can have many local minima, resulting in a bad clustering. Intuitively, since as we will see spectral clustering works on the eigenvectors of a similarity matrix/network between the original data points, it can capture inherent patterns in the data that might not be possible otherwise. In what follows we will present the steps followed by spectral clustering one-by-one.
Spectral clustering in a nutshell
The problem
The goal is given a set of ρ-dimensional data (say n in total), to group the instances in such a way that points belonging in the same cluster are more similar to each other than to data points in other clusters. There are many algorithms that can be used to complete this task, but here we will specifically focus on the spectral clustering approach. The latter has become one of the most popular modern clustering approaches and has been shown to outperform more traditional clustering algorithms that are based on generative models (e.g., EM). Such algorithms involve strong assumptions (e.g., that the data points in a cluster follow a multivariate Gaussian distribution) and the log likelihood can have many local minima, resulting in a bad clustering. Intuitively, since as we will see spectral clustering works on the eigenvectors of a similarity matrix/network between the original data points, it can capture inherent patterns in the data that might not be possible otherwise. In what follows we will present the steps followed by spectral clustering one-by-one.
Spectral clustering in a nutshell
Create a similarity graph between the data instances: Given our n data points, say →x1, →x2,...,→xn∈ℜρ, we construct a graph G=(V,E). Every vertex vi∈G, is essentially one data point (e.g., →xi). In order to define the set of edges E of this similarity graph, we essentially need to define a notion of similarity sij between the vertices (i.e., data points) vi and vj (i.e., →xi, →xj). Then the edges of the graph can be constructed in a variety of ways. Here we review the so called, m-nearest neighbor similarity graph. The goal is to connect vi, with vj if the latter is among the m-nearest neighbors of vi. However, note that this can create directed graphs, since vi and vj might not be simultaneously to each others m-nearest neighbors. For avoiding this problem, we connect vi and vj with an edge, if vi is among the m-nearest neighbors of vj or vj is among the m-nearest neighbors of vi. Note here, that the notion of m-nearest neighbors is related with the similarity metric sij. Hence, if we denote with Nm(i) the set of m-nearest neighbors of →xi, the adjacency matrix A of G is defined as:
aij={sij+α if vi∈Nm(j)∨vj∈Nm(i)0 otherwise
aij={sij+α if vi∈Nm(j)∨vj∈Nm(i)0 otherwise
where α is a small constant that prevents degenerate instances not having connections to anyone else. Depending on the specifics of the problem the parameter α might have a physical interpretation and/or it can be set to be equal to 0.
Compute the graph Laplacian: Once we have computed the adjacency matrix of the similarity graph we compute the graph Laplacian L as: L=D−A, where D is a diagonal matrix with:
dii=∑jAij
Usually we work with a normalized version, Ln, of the graph Laplacian given by:
Ln=D−1/2LD−1/2
Compute the spectrum of the graph Laplacian and apply the eigengap heuristic: At this point we will use the eigenvalues of the graph Laplacian to first identify the number of clusters k in our dataset. Based on the eigengap heuristic, we sort the eigenvalues λi, i∈{1,2,...,n}, of Ln in ascending order and we pick k as:
k=argmaxi(λi+1−λi)
Dimensionality reduction: Using the eigenvectors →vi, that correspond to the k largest eigenvalues of Ln we create matrix U=[→v1→v2...→vk]∈ℜnxk. We further normalize the rows of U so as to have unit norm.
Clustering step: Treating each row uj of U as a datapoint in ℜk, cluster them in k clusters, C1,C2,...,Ck, using k-means. This induces a clustering A1,A2,...,Ak, on the original data: Ai={j|uj∈Ci}.
If you need more information on the above spectral clustering algorithm, you can read Ng's, Jordan's and Weiss' seminal paper. Von Luxburg's tutorial on spectral clustering covers some variations of the above algorithm.
The MATLAB code
The spectral clustering algorithm will take the following inputs: (i) an nxρ matrix data, that includes the ρ-dimensional data we want to cluster, (ii) the number of minimum and maximum clusters that we would like to have (kmin and kmax), (iii) the number m of nearest neighbors that we will use for creating the similarity graph and (iv) the small constant α. If we do not want to impose limitations to the eigengap heuristic we can set kmin=0 and kmax=n. The MATLAB function will return a vector labels, which has the cluster label for the corresponding data point and a matrix C, which stores the centroids of the corresponding clusters.
First, we create the nxn adjacency matrix A for the similarity graph between the data points. We call MATLAB function knnsearch, to find the m nearest neighbors of each data point. We use the option 'Distance', which specifies the distance metric to use for the calculation. We have set it to the cosine similarity distance but one can use its own distance function as well. Once we have identified the m nearest neighbors of a given data point (i.e., a vertex of the similarity graph), we create the entries of the affinity matrix using the similarity metric (in our case the cosine similarity and the constant α:
We then need to calculate the similarity graph Laplacian L:
Next, we calculate the spectrum of the Laplacian and sort the eigenvalues in ascending order:
Having the eigenvalues sorted, we need to apply the eigengap heuristic. We will use the vector eigengap to store the difference in the values of two consecutive eigenvalues of A. We ignore the eigenvalues below kmin and above kmax, since we want the number of clusters to be in this range. In other words we restrict the eigengap heuristic in the range [kmin,kmax]:
Now we need to create a nxk matrix u where each column is one of the k first eigenvectors of A. We further need to normalize each rows to unit norm:
Finally, we cluster using k-means the rows of the matrix u. In principle you could use any other clustering algorithm you want. In our code, we use the MATLAB function means with the option 'Replicates' set. This options essentially runs k-means as many times as specified and returns the best solution. Note here that since we apply k-means on the eigenvectors of similarity matrix A, the centroids we obtain (i.e., Cspec) are not the centroids of the clusters of the actual data points. Hence, we need to compute them separately:
You can download the m-file here.
Compute the graph Laplacian: Once we have computed the adjacency matrix of the similarity graph we compute the graph Laplacian L as: L=D−A, where D is a diagonal matrix with:
dii=∑jAij
Usually we work with a normalized version, Ln, of the graph Laplacian given by:
Ln=D−1/2LD−1/2
Compute the spectrum of the graph Laplacian and apply the eigengap heuristic: At this point we will use the eigenvalues of the graph Laplacian to first identify the number of clusters k in our dataset. Based on the eigengap heuristic, we sort the eigenvalues λi, i∈{1,2,...,n}, of Ln in ascending order and we pick k as:
k=argmaxi(λi+1−λi)
Dimensionality reduction: Using the eigenvectors →vi, that correspond to the k largest eigenvalues of Ln we create matrix U=[→v1→v2...→vk]∈ℜnxk. We further normalize the rows of U so as to have unit norm.
Clustering step: Treating each row uj of U as a datapoint in ℜk, cluster them in k clusters, C1,C2,...,Ck, using k-means. This induces a clustering A1,A2,...,Ak, on the original data: Ai={j|uj∈Ci}.
If you need more information on the above spectral clustering algorithm, you can read Ng's, Jordan's and Weiss' seminal paper. Von Luxburg's tutorial on spectral clustering covers some variations of the above algorithm.
The MATLAB code
The spectral clustering algorithm will take the following inputs: (i) an nxρ matrix data, that includes the ρ-dimensional data we want to cluster, (ii) the number of minimum and maximum clusters that we would like to have (kmin and kmax), (iii) the number m of nearest neighbors that we will use for creating the similarity graph and (iv) the small constant α. If we do not want to impose limitations to the eigengap heuristic we can set kmin=0 and kmax=n. The MATLAB function will return a vector labels, which has the cluster label for the corresponding data point and a matrix C, which stores the centroids of the corresponding clusters.
function [labels, C]=spectral_clustering(data,k_min,k_max, m, a)
First, we create the nxn adjacency matrix A for the similarity graph between the data points. We call MATLAB function knnsearch, to find the m nearest neighbors of each data point. We use the option 'Distance', which specifies the distance metric to use for the calculation. We have set it to the cosine similarity distance but one can use its own distance function as well. Once we have identified the m nearest neighbors of a given data point (i.e., a vertex of the similarity graph), we create the entries of the affinity matrix using the similarity metric (in our case the cosine similarity and the constant α:
for i=1:1:length(data(:,1)) idx=knnsearch(data,data(i,:),'K',m,'Distance','cosine'); for j=1:1:m A(i,idx(j))=dot(data(i,:),data(idx(j),:))/(norm(data(i,:),2)*norm(data(idx(j),:),2)); if((A(i,idx(j)) == 0) || (isnan(A(i,idx(j))))) A(i,idx(j)) = a; end A(idx(j),i)=A(i,idx(j)); end end
We then need to calculate the similarity graph Laplacian L:
D=zeros(length(data(:,1))); for i=1:1:length(data(:,1)) D(i,i)=sum(A(i,:)); end L= D-A; L=D^(-1/2)*L*D^(-1/2);
Next, we calculate the spectrum of the Laplacian and sort the eigenvalues in ascending order:
[U,V]= eig(L); v=diag(V); [vs, is] = sort(v,'ascend');
Having the eigenvalues sorted, we need to apply the eigengap heuristic. We will use the vector eigengap to store the difference in the values of two consecutive eigenvalues of A. We ignore the eigenvalues below kmin and above kmax, since we want the number of clusters to be in this range. In other words we restrict the eigengap heuristic in the range [kmin,kmax]:
eigengaps = zeros(length(vs)-1,1); for i=1:1:length(eigengaps) if ((ik_max)) eigengaps(i)=-1; else eigengaps(i)=vs(i+1)-vs(i); end end [junk k] = max(eigengaps);
Now we need to create a nxk matrix u where each column is one of the k first eigenvectors of A. We further need to normalize each rows to unit norm:
u=[]; for i=1:1:k u = [u U(:,is(i))]; end for i=1:1:length(u(:,1)) u(i,:)=u(i,:)/sqrt(sum(u(i,:).*u(i,:))); end
Finally, we cluster using k-means the rows of the matrix u. In principle you could use any other clustering algorithm you want. In our code, we use the MATLAB function means with the option 'Replicates' set. This options essentially runs k-means as many times as specified and returns the best solution. Note here that since we apply k-means on the eigenvectors of similarity matrix A, the centroids we obtain (i.e., Cspec) are not the centroids of the clusters of the actual data points. Hence, we need to compute them separately:
[labels, C_spec] = kmeans(u,k,'Replicates',200); C=[]; for i=1:1:k IDX=find(labels == i); tmp=mean(data(IDX,:)); C=[C; tmp]; end
You can download the m-file here.