Processing math: 82%

Friday, February 1, 2013

Spectral clustering with eigengap heuristic: A MATLAB implementation

In this post I will present a step-by-step tutorial on a basic spectral clustering algorithm and a simple  implementation in MATLAB.

The problem

The goal is given a set of ρ-dimensional data (say n in total), to group the instances in such a way that points belonging in the same cluster are more similar to each other than to data points in other clusters.  There are many algorithms that can be used to complete this task, but here we will specifically focus on the spectral clustering approach.   The latter has become one of the most popular modern clustering approaches and has been shown to outperform more traditional clustering algorithms that are based on generative models (e.g., EM).  Such algorithms involve strong assumptions (e.g., that the data points in a cluster follow a multivariate Gaussian distribution) and the log likelihood can have many local minima, resulting in a bad clustering.  Intuitively, since as we will see spectral clustering works on the eigenvectors of a similarity matrix/network between the original data points, it can capture inherent patterns in the data that might not be possible otherwise.  In what follows we will present the steps followed by spectral clustering one-by-one.

Spectral clustering in a nutshell

Create a similarity graph between the data instances: Given our n data points, say x1, x2,...,xnρ, we construct  a graph G=(V,E).  Every vertex viG, is essentially one data point (e.g., xi).  In order to define the set of edges E of this similarity graph, we essentially need to define a notion of similarity sij between the vertices (i.e., data points) vi and vj (i.e., xi, xj).  Then the edges of the graph can be constructed in a variety of ways.  Here we review the so called, m-nearest neighbor similarity graph.  The goal is to connect vi, with vj if the latter is among the m-nearest neighbors of vi.  However, note that this can create directed graphs, since vi and vj might not be simultaneously to each others m-nearest neighbors.  For avoiding this problem, we connect vi and vj with an edge, if vi is among the m-nearest neighbors of vj or vj is among the m-nearest neighbors of vi.  Note here, that the notion of m-nearest neighbors is related with the similarity metric sij.  Hence, if we denote with Nm(i) the set of m-nearest neighbors of xi, the adjacency matrix A of G is defined as:
                                                  aij={sij+α if viNm(j)vjNm(i)0 otherwise

where α is a small constant that prevents degenerate instances not having connections to anyone else.  Depending on the specifics of the problem the parameter α might have a physical interpretation and/or it can be set to be equal to 0.

Compute the graph Laplacian: Once we have computed the adjacency matrix of the similarity graph we compute the graph Laplacian L as:  L=DA, where D is a diagonal matrix with:
                                                 dii=jAij
Usually we work with a normalized version, Ln, of the graph Laplacian given by:
                                                 Ln=D1/2LD1/2

Compute the spectrum of the graph Laplacian and apply the eigengap heuristic: At this point we will use the eigenvalues of the graph Laplacian to first identify the number of clusters k in our dataset.  Based on the eigengap heuristic, we sort the eigenvalues λi, i{1,2,...,n}, of Ln in ascending order and we pick k as:
                                                k=argmaxi(λi+1λi)

Dimensionality reduction: Using the eigenvectors vi, that correspond to the k largest eigenvalues of Ln we create matrix U=[v1v2...vk]nxk.  We further normalize the rows of U so as to have unit norm.

Clustering step: Treating each row uj of U as a datapoint in k, cluster them in k clusters, C1,C2,...,Ck, using k-means.  This induces a clustering A1,A2,...,Ak, on the original data: Ai={j|ujCi}.

If you need more information on the above spectral clustering algorithm, you can read Ng's, Jordan's and Weiss' seminal paper.  Von Luxburg's tutorial on spectral clustering covers some variations of the above algorithm.


The MATLAB code

The spectral clustering algorithm will take the following inputs: (i) an nxρ matrix data,  that includes the ρ-dimensional data we want to cluster,  (ii) the number of minimum and maximum clusters that we would like to have (kmin and kmax), (iii) the number m of nearest neighbors that we will use for creating the similarity graph and (iv) the small constant α.  If we do not want to impose limitations to the eigengap heuristic we can set kmin=0 and kmax=n.  The MATLAB function will return a vector labels, which has the cluster label for the corresponding data point and a matrix C, which stores the centroids of the corresponding clusters.

function [labels, C]=spectral_clustering(data,k_min,k_max, m, a)


First, we create the nxn adjacency matrix A for the similarity graph between the data points.  We call MATLAB function knnsearch, to find the m nearest neighbors of each data point.  We use the option 'Distance', which specifies the distance metric to use for the calculation.  We have set it to the cosine similarity distance but one can use its own distance function as well.  Once we have identified the m nearest neighbors of a given data point (i.e., a vertex of the similarity  graph),  we create the entries of the affinity matrix using the similarity metric (in our case the cosine similarity and the constant α:

for i=1:1:length(data(:,1))
        idx=knnsearch(data,data(i,:),'K',m,'Distance','cosine');
        for j=1:1:m
            A(i,idx(j))=dot(data(i,:),data(idx(j),:))/(norm(data(i,:),2)*norm(data(idx(j),:),2));
            if((A(i,idx(j)) == 0) || (isnan(A(i,idx(j)))))
                A(i,idx(j)) = a;
            end
            A(idx(j),i)=A(i,idx(j));
        end
end

We then need to calculate the similarity graph Laplacian L:
D=zeros(length(data(:,1)));
for i=1:1:length(data(:,1))
    D(i,i)=sum(A(i,:));
end
L= D-A;
L=D^(-1/2)*L*D^(-1/2);

Next, we calculate the spectrum of the Laplacian and sort the eigenvalues in ascending order:
[U,V]= eig(L);
v=diag(V);
[vs, is] = sort(v,'ascend');

Having the eigenvalues sorted, we need to apply the eigengap heuristic. We will use the vector eigengap to store the difference in the values of two consecutive eigenvalues of A. We ignore the eigenvalues below kmin and above kmax, since we want the number of clusters to be in this range. In other words we restrict the eigengap heuristic in the range [kmin,kmax]:
eigengaps = zeros(length(vs)-1,1);
for i=1:1:length(eigengaps)
    if ((i k_max))
        eigengaps(i)=-1;
    else
        eigengaps(i)=vs(i+1)-vs(i);
    end
end
[junk k] = max(eigengaps); 

Now we need to create a nxk matrix u where each column is one of the k first eigenvectors of A. We further need to normalize each rows to unit norm:
u=[];
for i=1:1:k 
    u = [u U(:,is(i))];
end
for i=1:1:length(u(:,1))
    u(i,:)=u(i,:)/sqrt(sum(u(i,:).*u(i,:)));
end

Finally, we cluster using k-means the rows of the matrix u. In principle you could use any other clustering algorithm you want. In our code, we use the MATLAB function means with the option 'Replicates' set. This options essentially runs k-means as many times as specified and returns the best solution. Note here that since we apply k-means on the eigenvectors of similarity matrix A, the centroids we obtain (i.e., Cspec) are not the centroids of the clusters of the actual data points. Hence, we need to compute them separately:
[labels, C_spec] = kmeans(u,k,'Replicates',200);
C=[];
for i=1:1:k
    IDX=find(labels == i);
    tmp=mean(data(IDX,:));
    C=[C; tmp];
end  

You can download the m-file here.

Monday, March 12, 2012

How to implement Logistic Regression with stochastic gradient ascent in Java.



I will try in this post to give a step by step tutorial for implementing from scratch a simple logistic regression for classification in Java. Note that the provided code is not optimized - however it works just fine.

The problem

The goal is to build a model for binary classification: given a vector of features \boldsymbol{x}^{(i)} we want to predict the corresponding label (y^{(i)} \in \{0,1\}). We assume that h_{\boldsymbol\alpha}(\boldsymbol{x}^{(i)}) is our prediction for instance i, which is given by the following equation:

h_{\boldsymbol{\alpha}}(\boldsymbol{x})= 1\over{1+e^{-\boldsymbol{\alpha} \boldsymbol{x}}} ,

where \boldsymbol{\alpha} is a weight vector. The right hand side of the previous equation is called the logistic function or the sigmoid function of \boldsymbol{\alpha}\boldsymbol{x} (see also here) .

If we assume that
\Pr(y=1|\boldsymbol{x},\boldsymbol{\alpha}) = h_{\boldsymbol\alpha}(\boldsymbol{x}) ,

\Pr(y=1|\boldsymbol{x},\boldsymbol{\alpha}) = 1 - h_{\boldsymbol\alpha}(\boldsymbol{x}) ,

we can write \Pr(y|\boldsymbol{x},\boldsymbol{\alpha}) as follows:

\Pr(y|\boldsymbol{x},\boldsymbol{\alpha})=h_{\alpha}(\boldsymbol{x})^y(1-h_{\alpha}(\boldsymbol{x}))^y .

Now we can use MLE (see here ) to estimate the optimal weights \boldsymbol\alpha. Assuming that we have m instances in our training set, the likelihood will be:

L(\alpha) = \prod_{i=1}^m \Pr(y^{(i)}|\boldsymbol{x^{(i)}}\boldsymbol{\alpha}) ,

and by taking the logs we get:

\log L(\boldsymbol\alpha) = \sum_{i=1}^{m}y^{(i)} \log h(x^{(i)}) + (1 -y^{(i)}) \log (1-h(x^{(i)})) .

But how do we maximize this likelihood? There are many techniques available in the literature. In this post, I will use the simplest one, stochastic gradient ascent (ascent because we maximize - see also here). By considering only one example at a time, the gradient of the log likelihood for each dimension j is:

\partial{\log L}\over{\partial{\alpha_j}} = (y - h_{\boldsymbol{\alpha}}(\boldsymbol{x}))x_j .

Now that we have estimated the gradient, the updating rule for each dimension will be:

\alpha_j := \alpha_j + \lambda (y^{(i)} - h_{\boldsymbol\alpha}(\boldsymbol{x}^{(i)}))x_j^{(i)} .

If you need more information about this procedure, see Andrew Ng's homepage .

The Java Code

In the code below I implemented what I just described. The Logistic class contains all the necessary methods. This is how you can call it:
/* Call Logistic constructor
initial coefficients (coeffs) to zero
*/
Logistic logistic = new Logistic(coeffs, 0.0000001, l, 10000, 0.00001,null);
logistic.classifyBatch();
Further, I am using the following holder class for keeping an instance:
public class BinaryInstance {
private boolean label;
private double[] x;

public BinaryInstance() {
// TODO Auto-generated constructor stub
}

public boolean getLabel() {
return label;
}

public void setLabel(boolean label) {
this.label = label;
}

public double[] getX() {
return x;
}

public void setX(double[] x) {
this.x = x;
}
}



Next, I am calling a method from a class "Utils" to transform raw data to "BinaryInstance" data:

/* In this specific example, "ad" and "nonad" are the labels. */

public BinaryInstance rowDataToInstance(String instance) {
BinaryInstance inst = new BinaryInstance();
Boolean labelAd = instance.endsWith(",ad.");
instance = instance.replace(",ad.", "");
instance = instance.replace(",nonad.", "");
inst.setX(getCoeffs(instance));
inst.setLabel(labelAd);
return inst;

}

public double [] getCoeffs(String coeffs){

coeffs = coeffs.replaceAll("\\[", "");
coeffs = coeffs.replaceAll("\\]", "");
String [] tmpAr = coeffs.split(",");
double [] result = new double [tmpAr.length];
for(int i=0; i< tmpAr.length; i++) result[i] = Double.parseDouble(tmpAr[i].trim()); return result;

}


Finally, the Logistic class is given below:

package Classifiers;

import java.util.ArrayList;
import java.util.Collections;
import BinaryInstance;

public class Logistic {
private double[] alpha;
private double lambda;
private Utils u;
private double logLikelihood;
private double prevLogLikelihood;
private ArrayList<Binaryinstance> allData;
private int datasetIterations;
private int maxDatasetIterations;
private double tolerance;

/**
*
* @param alpha
* : weights
* @param lambda
* : learning rate
* @param data
* : dataset (instances to vector of values)
* @param maxDatasetIterations
* @param tolerance
*/
public de(double[] alpha, double lambda,ArrayList<string> data,
int maxDatasetIterations, double tolerance) {
this.alpha = alpha;
this.lambda = lambda;
prevLogLikelihood = Double.POSITIVE_INFINITY;
logLikelihood = 0;
allData = new ArrayList<Binaryinstance>();
u = new Utils();
for (String row : data) {
BinaryInstance instance = u.rowDataToInstance(row);
allData.add(instance);
}
datasetIterations = 0;
this.maxDatasetIterations = maxDatasetIterations;
this.tolerance = tolerance;

}

public void classifyByInstance() {

while (evaluateCondition()) {
prevLogLikelihood = logLikelihood;
datasetIterations++;
Collections.shuffle(allData);
for (BinaryInstance instance : allData) {
double probPositive = estimateProbs(instance.getX());
double label = (instance.getLabel() == true) ? 1 : 0;
adjustWeights(instance.getX(), probPositive, label);

}
logLikelihood = calculateLogL(allData);

}

}

private boolean evaluateCondition() {
return (Math.abs(logLikelihood - prevLogLikelihood) > tolerance && datasetIterations < maxDatasetIterations) ? true : false; } private double estimateProbs(double[] x) { double sum = alpha[0]; for (int i = 1; i < this.alpha.length; i++) sum += this.alpha[i] * x[i - 1]; double exponent = Math.exp(-sum); double probPositive = 1 / (1 + exponent); if (probPositive == 0) probPositive = 0.00001; else if (probPositive == 1) probPositive = 0.9999; return probPositive; } private void adjustWeights(double[] x, double probPositive, double label) { //for the intercept this.alpha[0] += this.lambda * (label - probPositive); for (int i = 1; i < this.alpha.length; i++) { this.alpha[i] += this.lambda * x[i - 1] * (label - probPositive); } } private double calculateLogL(ArrayList<Binaryinstance> allData) {

double logl = 0;
for (BinaryInstance instance : allData) {

double probPositive = estimateProbs(instance.getX());
double label = (instance.getLabel() == true) ? 1 : 0;
double probNegative = 1 - probPositive;
double tmp = label * Math.log(probPositive) + (1 - label)
* Math.log(probNegative);
logl += tmp;
}
return logl;

}
}