Friday, February 1, 2013

Spectral clustering with eigengap heuristic: A MATLAB implementation

In this post I will present a step-by-step tutorial on a basic spectral clustering algorithm and a simple  implementation in MATLAB.

The problem

The goal is given a set of $\rho$-dimensional data (say $n$ in total), to group the instances in such a way that points belonging in the same cluster are more similar to each other than to data points in other clusters.  There are many algorithms that can be used to complete this task, but here we will specifically focus on the spectral clustering approach.   The latter has become one of the most popular modern clustering approaches and has been shown to outperform more traditional clustering algorithms that are based on generative models (e.g., EM).  Such algorithms involve strong assumptions (e.g., that the data points in a cluster follow a multivariate Gaussian distribution) and the log likelihood can have many local minima, resulting in a bad clustering.  Intuitively, since as we will see spectral clustering works on the eigenvectors of a similarity matrix/network between the original data points, it can capture inherent patterns in the data that might not be possible otherwise.  In what follows we will present the steps followed by spectral clustering one-by-one.

Spectral clustering in a nutshell

Create a similarity graph between the data instances: Given our $n$ data points, say $\overrightarrow{x_1},~\overrightarrow{x_2},...,\overrightarrow{x_n} \in \Re^{\rho}$, we construct  a graph $G=(V,E)$.  Every vertex $v_i\in G$, is essentially one data point (e.g., $\overrightarrow{x_i}$).  In order to define the set of edges $E$ of this similarity graph, we essentially need to define a notion of similarity $s_{ij}$ between the vertices (i.e., data points) $v_i$ and $v_j$ (i.e., $\overrightarrow{x_i}$, $\overrightarrow{x_j}$).  Then the edges of the graph can be constructed in a variety of ways.  Here we review the so called, $m$-nearest neighbor similarity graph.  The goal is to connect $v_i$, with $v_j$ if the latter is among the $m$-nearest neighbors of $v_i$.  However, note that this can create directed graphs, since $v_i$ and $v_j$ might not be simultaneously to each others $m$-nearest neighbors.  For avoiding this problem, we connect $v_i$ and $v_j$ with an edge, if $v_i$ is among the $m$-nearest neighbors of $v_j$ or $v_j$ is among the $m$-nearest neighbors of $v_i$.  Note here, that the notion of $m$-nearest neighbors is related with the similarity metric $s_{ij}$.  Hence, if we denote with $N_m(i)$ the set of $m$-nearest neighbors of $\overrightarrow{x_i}$, the adjacency matrix ${\bf A}$ of $G$ is defined as:
                                                  $a_{ij} = \begin{cases}
                                                  s_{ij}+\alpha & \textrm{ if $v_i \in N_m(j) \lor v_j \in N_m(i)$} \\
                                                  0 & \textrm{ otherwise} \\
                                                  \end{cases}$

where $\alpha$ is a small constant that prevents degenerate instances not having connections to anyone else.  Depending on the specifics of the problem the parameter $\alpha$ might have a physical interpretation and/or it can be set to be equal to 0.

Compute the graph Laplacian: Once we have computed the adjacency matrix of the similarity graph we compute the graph Laplacian $L$ as:  $L = D-A$, where $D$ is a diagonal matrix with:
                                                 $d_{ii} = \sum_{j} A_{ij}$
Usually we work with a normalized version, $L_n$, of the graph Laplacian given by:
                                                 $L_n = D^{-1/2}LD^{-1/2}$

Compute the spectrum of the graph Laplacian and apply the eigengap heuristic: At this point we will use the eigenvalues of the graph Laplacian to first identify the number of clusters $k$ in our dataset.  Based on the eigengap heuristic, we sort the eigenvalues $\lambda_i,~i\in \{1,2,...,n\}$, of $L_n$ in ascending order and we pick $k$ as:
                                                $k=argmax_{i} (\lambda_{i+1}-\lambda_i)$

Dimensionality reduction: Using the eigenvectors $\overrightarrow{v_i}$, that correspond to the $k$ largest eigenvalues of $L_n$ we create matrix $U=[\overrightarrow{v_1} \overrightarrow{v_2}...\overrightarrow{v_k}]\in \Re^{nxk}$.  We further normalize the rows of $U$ so as to have unit norm.

Clustering step: Treating each row $u_j$ of $U$ as a datapoint in $\Re^k$, cluster them in $k$ clusters, $C_1, C_2,...,C_k$, using k-means.  This induces a clustering $A_1, A_2,...,A_k$, on the original data: $A_i=\{j|u_j \in C_i\}$.

If you need more information on the above spectral clustering algorithm, you can read Ng's, Jordan's and Weiss' seminal paper.  Von Luxburg's tutorial on spectral clustering covers some variations of the above algorithm.


The MATLAB code

The spectral clustering algorithm will take the following inputs: (i) an $n {\tt x} \rho$ matrix ${\tt data}$,  that includes the $\rho$-dimensional data we want to cluster,  (ii) the number of minimum and maximum clusters that we would like to have (${\tt k_{min}}$ and ${\tt k_{max}}$), (iii) the number ${\tt m}$ of nearest neighbors that we will use for creating the similarity graph and (iv) the small constant ${\tt \alpha}$.  If we do not want to impose limitations to the eigengap heuristic we can set ${\tt k_{min}} = 0$ and ${\tt k_{max}} = n$.  The MATLAB function will return a vector ${\tt labels}$, which has the cluster label for the corresponding data point and a matrix ${\tt C}$, which stores the centroids of the corresponding clusters.

function [labels, C]=spectral_clustering(data,k_min,k_max, m, a)


First, we create the $n{\tt x} n$ adjacency matrix ${\tt A}$ for the similarity graph between the data points.  We call MATLAB function ${\tt knnsearch}$, to find the ${\tt m}$ nearest neighbors of each data point.  We use the option 'Distance', which specifies the distance metric to use for the calculation.  We have set it to the cosine similarity distance but one can use its own distance function as well.  Once we have identified the ${\tt m}$ nearest neighbors of a given data point (i.e., a vertex of the similarity  graph),  we create the entries of the affinity matrix using the similarity metric (in our case the cosine similarity and the constant ${\tt \alpha}$:

for i=1:1:length(data(:,1))
        idx=knnsearch(data,data(i,:),'K',m,'Distance','cosine');
        for j=1:1:m
            A(i,idx(j))=dot(data(i,:),data(idx(j),:))/(norm(data(i,:),2)*norm(data(idx(j),:),2));
            if((A(i,idx(j)) == 0) || (isnan(A(i,idx(j)))))
                A(i,idx(j)) = a;
            end
            A(idx(j),i)=A(i,idx(j));
        end
end

We then need to calculate the similarity graph Laplacian $L$:
D=zeros(length(data(:,1)));
for i=1:1:length(data(:,1))
    D(i,i)=sum(A(i,:));
end
L= D-A;
L=D^(-1/2)*L*D^(-1/2);

Next, we calculate the spectrum of the Laplacian and sort the eigenvalues in ascending order:
[U,V]= eig(L);
v=diag(V);
[vs, is] = sort(v,'ascend');

Having the eigenvalues sorted, we need to apply the eigengap heuristic. We will use the vector ${\tt eigengap}$ to store the difference in the values of two consecutive eigenvalues of ${\tt A}$. We ignore the eigenvalues below $k_{min}$ and above $k_{max}$, since we want the number of clusters to be in this range. In other words we restrict the eigengap heuristic in the range $[{\tt k_{min},k_{max}}]$:
eigengaps = zeros(length(vs)-1,1);
for i=1:1:length(eigengaps)
    if ((i k_max))
        eigengaps(i)=-1;
    else
        eigengaps(i)=vs(i+1)-vs(i);
    end
end
[junk k] = max(eigengaps); 

Now we need to create a $n {\tt x} k$ matrix ${\tt u}$ where each column is one of the $k$ first eigenvectors of ${\tt A}$. We further need to normalize each rows to unit norm:
u=[];
for i=1:1:k 
    u = [u U(:,is(i))];
end
for i=1:1:length(u(:,1))
    u(i,:)=u(i,:)/sqrt(sum(u(i,:).*u(i,:)));
end

Finally, we cluster using k-means the rows of the matrix ${\tt u}$. In principle you could use any other clustering algorithm you want. In our code, we use the MATLAB function ${\tt means}$ with the option 'Replicates' set. This options essentially runs k-means as many times as specified and returns the best solution. Note here that since we apply k-means on the eigenvectors of similarity matrix ${\tt A}$, the centroids we obtain (i.e., ${\tt C_{spec}}$) are not the centroids of the clusters of the actual data points. Hence, we need to compute them separately:
[labels, C_spec] = kmeans(u,k,'Replicates',200);
C=[];
for i=1:1:k
    IDX=find(labels == i);
    tmp=mean(data(IDX,:));
    C=[C; tmp];
end  

You can download the m-file here.

Monday, March 12, 2012

How to implement Logistic Regression with stochastic gradient ascent in Java.



I will try in this post to give a step by step tutorial for implementing from scratch a simple logistic regression for classification in Java. Note that the provided code is not optimized - however it works just fine.

The problem

The goal is to build a model for binary classification: given a vector of features $\boldsymbol{x}^{(i)}$ we want to predict the corresponding label ($y^{(i)} \in \{0,1\}$). We assume that $h_{\boldsymbol\alpha}(\boldsymbol{x}^{(i)})$ is our prediction for instance $i$, which is given by the following equation:

$h_{\boldsymbol{\alpha}}(\boldsymbol{x})$= $1\over{1+e^{-\boldsymbol{\alpha} \boldsymbol{x}}}$ ,

where $\boldsymbol{\alpha}$ is a weight vector. The right hand side of the previous equation is called the logistic function or the sigmoid function of $\boldsymbol{\alpha}\boldsymbol{x}$ (see also here) .

If we assume that
$\Pr(y=1|\boldsymbol{x},\boldsymbol{\alpha}) = h_{\boldsymbol\alpha}(\boldsymbol{x})$ ,

$\Pr(y=1|\boldsymbol{x},\boldsymbol{\alpha}) = 1 - h_{\boldsymbol\alpha}(\boldsymbol{x})$ ,

we can write $\Pr(y|\boldsymbol{x},\boldsymbol{\alpha})$ as follows:

$\Pr(y|\boldsymbol{x},\boldsymbol{\alpha})=h_{\alpha}(\boldsymbol{x})^y(1-h_{\alpha}(\boldsymbol{x}))^y$ .

Now we can use MLE (see here ) to estimate the optimal weights $\boldsymbol\alpha$. Assuming that we have $m$ instances in our training set, the likelihood will be:

$L(\alpha) = \prod_{i=1}^m \Pr(y^{(i)}|\boldsymbol{x^{(i)}}\boldsymbol{\alpha})$ ,

and by taking the logs we get:

$\log L(\boldsymbol\alpha) = \sum_{i=1}^{m}y^{(i)} \log h(x^{(i)}) + (1 -y^{(i)}) \log (1-h(x^{(i)}))$ .

But how do we maximize this likelihood? There are many techniques available in the literature. In this post, I will use the simplest one, stochastic gradient ascent (ascent because we maximize - see also here). By considering only one example at a time, the gradient of the log likelihood for each dimension $j$ is:

$\partial{\log L}\over{\partial{\alpha_j}}$ $ = (y - h_{\boldsymbol{\alpha}}(\boldsymbol{x}))x_j$ .

Now that we have estimated the gradient, the updating rule for each dimension will be:

$\alpha_j := \alpha_j + \lambda (y^{(i)} - h_{\boldsymbol\alpha}(\boldsymbol{x}^{(i)}))x_j^{(i)}$ .

If you need more information about this procedure, see Andrew Ng's homepage .

The Java Code

In the code below I implemented what I just described. The Logistic class contains all the necessary methods. This is how you can call it:
/* Call Logistic constructor
initial coefficients (coeffs) to zero
*/
Logistic logistic = new Logistic(coeffs, 0.0000001, l, 10000, 0.00001,null);
logistic.classifyBatch();
Further, I am using the following holder class for keeping an instance:
public class BinaryInstance {
private boolean label;
private double[] x;

public BinaryInstance() {
// TODO Auto-generated constructor stub
}

public boolean getLabel() {
return label;
}

public void setLabel(boolean label) {
this.label = label;
}

public double[] getX() {
return x;
}

public void setX(double[] x) {
this.x = x;
}
}



Next, I am calling a method from a class "Utils" to transform raw data to "BinaryInstance" data:

/* In this specific example, "ad" and "nonad" are the labels. */

public BinaryInstance rowDataToInstance(String instance) {
BinaryInstance inst = new BinaryInstance();
Boolean labelAd = instance.endsWith(",ad.");
instance = instance.replace(",ad.", "");
instance = instance.replace(",nonad.", "");
inst.setX(getCoeffs(instance));
inst.setLabel(labelAd);
return inst;

}

public double [] getCoeffs(String coeffs){

coeffs = coeffs.replaceAll("\\[", "");
coeffs = coeffs.replaceAll("\\]", "");
String [] tmpAr = coeffs.split(",");
double [] result = new double [tmpAr.length];
for(int i=0; i< tmpAr.length; i++) result[i] = Double.parseDouble(tmpAr[i].trim()); return result;

}


Finally, the Logistic class is given below:

package Classifiers;

import java.util.ArrayList;
import java.util.Collections;
import BinaryInstance;

public class Logistic {
private double[] alpha;
private double lambda;
private Utils u;
private double logLikelihood;
private double prevLogLikelihood;
private ArrayList<Binaryinstance> allData;
private int datasetIterations;
private int maxDatasetIterations;
private double tolerance;

/**
*
* @param alpha
* : weights
* @param lambda
* : learning rate
* @param data
* : dataset (instances to vector of values)
* @param maxDatasetIterations
* @param tolerance
*/
public de(double[] alpha, double lambda,ArrayList<string> data,
int maxDatasetIterations, double tolerance) {
this.alpha = alpha;
this.lambda = lambda;
prevLogLikelihood = Double.POSITIVE_INFINITY;
logLikelihood = 0;
allData = new ArrayList<Binaryinstance>();
u = new Utils();
for (String row : data) {
BinaryInstance instance = u.rowDataToInstance(row);
allData.add(instance);
}
datasetIterations = 0;
this.maxDatasetIterations = maxDatasetIterations;
this.tolerance = tolerance;

}

public void classifyByInstance() {

while (evaluateCondition()) {
prevLogLikelihood = logLikelihood;
datasetIterations++;
Collections.shuffle(allData);
for (BinaryInstance instance : allData) {
double probPositive = estimateProbs(instance.getX());
double label = (instance.getLabel() == true) ? 1 : 0;
adjustWeights(instance.getX(), probPositive, label);

}
logLikelihood = calculateLogL(allData);

}

}

private boolean evaluateCondition() {
return (Math.abs(logLikelihood - prevLogLikelihood) > tolerance && datasetIterations < maxDatasetIterations) ? true : false; } private double estimateProbs(double[] x) { double sum = alpha[0]; for (int i = 1; i < this.alpha.length; i++) sum += this.alpha[i] * x[i - 1]; double exponent = Math.exp(-sum); double probPositive = 1 / (1 + exponent); if (probPositive == 0) probPositive = 0.00001; else if (probPositive == 1) probPositive = 0.9999; return probPositive; } private void adjustWeights(double[] x, double probPositive, double label) { //for the intercept this.alpha[0] += this.lambda * (label - probPositive); for (int i = 1; i < this.alpha.length; i++) { this.alpha[i] += this.lambda * x[i - 1] * (label - probPositive); } } private double calculateLogL(ArrayList<Binaryinstance> allData) {

double logl = 0;
for (BinaryInstance instance : allData) {

double probPositive = estimateProbs(instance.getX());
double label = (instance.getLabel() == true) ? 1 : 0;
double probNegative = 1 - probPositive;
double tmp = label * Math.log(probPositive) + (1 - label)
* Math.log(probNegative);
logl += tmp;
}
return logl;

}
}