sPLS-DA:srbct

Variables representation

We can represent the genes selected with sPLS-DA on correlation circles. See also this page for more details about correlation circles (2D and 3D plots).

plotVar(result, comp = 1:2, var.label = srbct$gene.name[, 1], cex = 0.5)

Here we have chosen to display the id of the gene instead of points. We can see it in 3D as well:

plot3dVar(result, comp = 1:3, X.label = srbct$gene.name[, 1], cex = 0.5,
          axes.box = "axes")

The gray sphere has a radius of one.

Estimation of the classification error rate

With sPLS-DA it is possible to estimate the classification error rate using cross-validation. Below is a home made script to estimate such error rate with respect to the number of selected variables. In thepredict function, several methods are proposed to predict the class of a sample: "max.dist", "centroids.dist" and "mahalanobis.dist".

## Set number of components to 3
ncomp <- 3

## Total number of selected genes on all ncomp dimensions
keepX <- round(c(seq(5, 45, 5), seq(50, 500, 50))/ncomp)
error <- matrix(NA, nrow = length(keepX), ncol = 3) 

for (i in 1:length(keepX)) {
    error[i, ] <- valid(X, Y, ncomp = 3, keepX = rep(keepX[i], ncomp),
                        pred.method = "max.dist", method = "splsda",
                        validation = "Mfold", M = 10)
}                   

## Plot the error obtained for each dimension
matplot(error, type = 'l', axes = FALSE, xlab = 'number of selected genes',
        ylab = 'error rate', col = c("black", "red", "blue"), lwd = 2, lty = 1)
axis(1, c(1:length(keepX)), labels = keepX)
axis(2)
legend(6, 0.45, lty = 1, legend = c('dim 1', 'dim 1:2', 'dim 1:3'),
       horiz = TRUE, cex = 0.9, col = c("black", "red", "blue"), lwd = 2)

We can see on this graph that the error rate is lower when a sufficient number of dimensions are run. In sPLS-DA case, ncomp should be set to k-1 where k is the number of classes. To obtain a more reliable estimation of the error rate, the computations above should be repeated several times and then averaged.

This type of graph may also help to choose the ‘optimal’ number of variables to select, as well as the number of dimensions ncomp.

References

  • Khan et al. (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine 7(6).

About sPLS-DA and its performance related to other approaches:

  • Lê Cao K.-A., Boitard S. and Besse P. (2011) Sparse PLS Discriminant Analysis: biologically relevant feature selection and graphical displays for multiclass problems.  [link].